Source author record

Cong Yao

Cong Yao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning math.CV Computation and Language Artificial Intelligence eess.IV math.AP

Catalog footprint

What is connected

26works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Diffeomorphic solutions of Ahlfors-Hopf equations

Here we advance the study of boundary the value problem for extremal functions of mean distortion and the associated Teichmüller spaces interpolating between the classical examples of extremal quasiconformal mappings, and the more recent approach through harmonic mappings (of extreme Dirichlet energy). In this paper we focus on the Alhfors-Hopf differential \[ Φ=\mathcal{A}(\mathbb{K}(w,h))h_w\,\overline{h_{\overline{w}}}\, η(h), \] where $h=f^{-1}$ is the pseudo-inverse of an extremal mapping $f$ for the problem \[ \inf_{f:\mathbb{D}\to\mathbb{D}}\int_\mathbb{D} \mathcal{A}(\mathbb{K}(z,f)) \; dz, \quad\quad \mathbb{K}(z,f) = \frac{|f_z|^2+|f_{\overline{z}}|^2}{|f_z|^2-|f_{\overline{z}}|^2}. \] where the infimum is taken over those homeomorphisms of finite distortion $f:\overline{\mathbb{D}}\to\overline{\mathbb{D}}$ with $f|\mathbb{S}=f_0$, typically a quasisymmetric barrier function. The inner-variational equations, an analogue of the Euler-Lagrange equations, show $Φ$ is holomorphic at an extremal. Exploiting this Ahlfors-Hopf differential, we prove that an extreme point $f$ is a local diffeomorphism in $\mathbb{D}$, resolving some conjectures in [16].

preprint2026arXiv

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.

preprint2024arXiv

LORE++: Logical Location Regression Network for Table Structure Recognition with Pre-training

Table structure recognition (TSR) aims at extracting tables in images into machine-understandable formats. Recent methods solve this problem by predicting the adjacency relations of detected cell boxes or learning to directly generate the corresponding markup sequences from the table images. However, existing approaches either count on additional heuristic rules to recover the table structures, or face challenges in capturing long-range dependencies within tables, resulting in increased complexity. In this paper, we propose an alternative paradigm. We model TSR as a logical location regression problem and propose a new TSR framework called LORE, standing for LOgical location REgression network, which for the first time regresses logical location as well as spatial location of table cells in a unified network. Our proposed LORE is conceptually simpler, easier to train, and more accurate than other paradigms of TSR. Moreover, inspired by the persuasive success of pre-trained models on a number of computer vision and natural language processing tasks, we propose two pre-training tasks to enrich the spatial and logical representations at the feature level of LORE, resulting in an upgraded version called LORE++. The incorporation of pre-training in LORE++ has proven to enjoy significant advantages, leading to a substantial enhancement in terms of accuracy, generalization, and few-shot capability compared to its predecessor. Experiments on standard benchmarks against methods of previous paradigms demonstrate the superiority of LORE++, which highlights the potential and promising prospect of the logical location regression paradigm for TSR.

preprint2022arXiv

On the uniqueness of extremal mappings of finite distortion

For an arbitrary convex function $Ψ:[1,\infty) \to [1,\infty)$, we consider uniqueness in the following two related extremal problems: Problem A boundary value problem: Establish the existence of, and describe the mapping $f$, achieving \[ \inf_f \Big\{ \int_{\Bbb D} Ψ({\Bbb K}(z,f))\; dz : f:\bar{\Bbb D} \to \bar{\Bbb D} \; \mbox{a homeomorphism in $W^{1,1}_{0}({\Bbb D})+f_0$} \Big\}. \] Here the data $f_0:\bar{\Bbb D} \to \bar{\Bbb D}$ is a homeomorphism of finite distortion with $\int_{\Bbb D} Ψ({\Bbb K}(z,f_0))\; dz<\infty$ -- a barrier. Next, given two homeomorphic Riemann surfaces $R$ and $S$ and data $f_0:R \to S$ a diffeomorphism. \noindent{\bf Problem B} {\em (extremal in homotopy class):} Establish the existence of, and describe the mapping $f$, achieving \[ \inf_f \Big\{ \int_R Ψ({\Bbb K}(z,f))\; \;dσ(z) : \mbox{$f$ a homeomorphism homotopic to $f_0$} \Big\}. \] There are two basic obstructions to existence and regularity. These are first, the existence of an Ahlfors-Hopf differential and second that the minimiser is a homeomorphism. When these restrictions are met (as they often can be) we show uniqueness is assured. These results are established through a generalisation the classical Reich-Strebel inequalities to this variational setting.

preprint2022arXiv

Revisiting Document Image Dewarping by Grid Regularization

This paper addresses the problem of document image dewarping, which aims at eliminating the geometric distortion in document images for document digitization. Instead of designing a better neural network to approximate the optical flow fields between the inputs and outputs, we pursue the best readability by taking the text lines and the document boundaries into account from a constrained optimization perspective. Specifically, our proposed method first learns the boundary points and the pixels in the text lines and then follows the most simple observation that the boundaries and text lines in both horizontal and vertical directions should be kept after dewarping to introduce a novel grid regularization scheme. To obtain the final forward mapping for dewarping, we solve an optimization problem with our proposed grid regularization. The experiments comprehensively demonstrate that our proposed approach outperforms the prior arts by large margins in terms of readability (with the metrics of Character Errors Rate and the Edit Distance) while maintaining the best image quality on the publicly-available DocUNet benchmark.

preprint2022arXiv

Vision-Language Pre-Training for Boosting Scene Text Detectors

Recently, vision-language joint representation learning has proven to be highly effective in various scenarios. In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language. Concretely, we propose to learn contextualized, joint representations through vision-language pre-training, for the sake of enhancing the performance of scene text detectors. Towards this end, we devise a pre-training architecture with an image encoder, a text encoder and a cross-modal encoder, as well as three pretext tasks: image-text contrastive learning (ITC), masked language modeling (MLM) and word-in-image prediction (WIP). The pre-trained model is able to produce more informative representations with richer semantics, which could readily benefit existing scene text detectors (such as EAST and PSENet) in the down-stream text detection task. Extensive experiments on standard benchmarks demonstrate that the proposed paradigm can significantly improve the performance of various representative text detectors, outperforming previous pre-training approaches. The code and pre-trained models will be publicly released.

preprint2020arXiv

A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling

Irregular scene text recognition has attracted much attention from the research community, mainly due to the complexity of shapes of text in natural scene. However, recent methods either rely on shape-sensitive modules such as bounding box regression, or discard sequence learning. To tackle these issues, we propose a pair of coupling modules, termed as Character Anchoring Module (CAM) and Anchor Pooling Module (APM), to extract high-level semantics from two-dimensional space to form feature sequences. The proposed CAM localizes the text in a shape-insensitive way by design by anchoring characters individually. APM then interpolates and gathers features flexibly along the character anchors which enables sequence learning. The complementary modules realize a harmonic unification of spatial information and sequence learning. With the proposed modules, our recognition system surpasses previous state-of-the-art scores on irregular and perspective text datasets, including, ICDAR 2015, CUTE, and Total-Text, while paralleling state-of-the-art performance on regular text datasets.

preprint2020arXiv

Differentiable Feature Aggregation Search for Knowledge Distillation

Knowledge distillation has become increasingly important in model compression. It boosts the performance of a miniaturized student network with the supervision of the output distribution and feature maps from a sophisticated teacher network. Some recent works introduce multi-teacher distillation to provide more supervision to the student network. However, the effectiveness of multi-teacher distillation methods are accompanied by costly computation resources. To tackle with both the efficiency and the effectiveness of knowledge distillation, we introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework by extracting informative supervision from multiple teacher feature maps. Specifically, we introduce DFA, a two-stage Differentiable Feature Aggregation search method that motivated by DARTS in neural architecture search, to efficiently find the aggregations. In the first stage, DFA formulates the searching problem as a bi-level optimization and leverages a novel bridge loss, which consists of a student-to-teacher path and a teacher-to-student path, to find appropriate feature aggregations. The two paths act as two players against each other, trying to optimize the unified architecture parameters to the opposite directions while guaranteeing both expressivity and learnability of the feature aggregation simultaneously. In the second stage, DFA performs knowledge distillation with the derived feature aggregation. Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets under various teacher-student settings, verifying the effectiveness and robustness of the design.

preprint2020arXiv

Higher regularity and uniqueness for inner variational equations

We study local minima of the $p$-conformal energy functionals, \[ \mathsf{E}_{\cal A}^\ast(h):=\int_\ID {\cal A}(\IK(w,h)) \;J(w,h) \; dw,\quad h|_\IS=h_0|_\IS, \] defined for self mappings $h:\ID\to\ID$ with finite distortion of the unit disk with prescribed boundary values $h_0$. Here $\IK(w,h) = \frac{\|Dh(w)\|^2}{J(w,h)} $ is the pointwise distortion functional, and ${\cal A}:[1,\infty)\to [1,\infty)$ is convex and increasing with ${\cal A}(t)\approx t^p$ for some $p\geq 1$, with additional minor technical conditions. Note ${\cal A}(t)=t$ is the Dirichlet energy functional. Critical points of $\mathsf{E}_{\cal A}^\ast$ satisfy the Ahlfors-Hopf inner-variational equation \[ {\cal A}'(\IK(w,h)) h_w \overline{h_\wbar} = Φ\] where $Φ$ is a holomorphic function. Iwaniec, Kovalev and Onninen established the Lipschitz regularity of critical points. Here we give a sufficient condition to ensure that a local minimum is a diffeomorphic solution to this equation, and that it is unique. This condition is necessarily satisfied by any locally quasiconformal critical point, and is basically the assumption $\IK(w,h)\in L^1(\ID)\cap L^r_{loc}(\ID)$ for some $r>1$.

preprint2020arXiv

On Vocabulary Reliance in Scene Text Recognition

The pursuit of high performance on public benchmarks has been the driving force for research in scene text recognition, and notable progress has been achieved. However, a close investigation reveals a startling fact that the state-of-the-art methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary. We call this phenomenon "vocabulary reliance". In this paper, we establish an analytical framework to conduct an in-depth study on the problem of vocabulary reliance in scene text recognition. Key findings include: (1) Vocabulary reliance is ubiquitous, i.e., all existing algorithms more or less exhibit such characteristic; (2) Attention-based decoders prove weak in generalizing to words outside vocabulary and segmentation-based decoders perform well in utilizing visual features; (3) Context modeling is highly coupled with the prediction layers. These findings provide new insights and can benefit future research in scene text recognition. Furthermore, we propose a simple yet effective mutual learning strategy to allow models of two families (attention-based and segmentation-based) to learn collaboratively. This remedy alleviates the problem of vocabulary reliance and improves the overall scene text recognition performance.

preprint2020arXiv

Scene Text Detection and Recognition: The Deep Learning Era

With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition has been inescapably influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, approach and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and the grand challenges still remained. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected and compiled in our Github repository: https://github.com/Jyouhou/SceneTextPapers.

preprint2020arXiv

TextScanner: Reading Characters in Order for Robust Scene Text Recognition

Driven by deep learning and the large volume of data, scene text recognition has evolved rapidly in recent years. Formerly, RNN-attention based methods have dominated this field, but suffer from the problem of \textit{attention drift} in certain situations. Lately, semantic segmentation based algorithms have proven effective at recognizing text of different forms (horizontal, oriented and curved). However, these methods may produce spurious characters or miss genuine characters, as they rely heavily on a thresholding procedure operated on segmentation maps. To tackle these challenges, we propose in this paper an alternative approach, called TextScanner, for scene text recognition. TextScanner bears three characteristics: (1) Basically, it belongs to the semantic segmentation family, as it generates pixel-wise, multi-channel segmentation maps for character class, position and order; (2) Meanwhile, akin to RNN-attention based methods, it also adopts RNN for context modeling; (3) Moreover, it performs paralleled prediction for character position and class, and ensures that characters are transcripted in correct order. The experiments on standard benchmark datasets demonstrate that TextScanner outperforms the state-of-the-art methods. Moreover, TextScanner shows its superiority in recognizing more difficult text such Chinese transcripts and aligning with target characters.

preprint2020arXiv

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Driven by deep neural networks and large scale datasets, scene text detection methods have progressed substantially over the past years, continuously refreshing the performance records on various standard benchmarks. However, limited by the representations (axis-aligned rectangles, rotated rectangles or quadrangles) adopted to describe text, existing methods may fall short when dealing with much more free-form text instances, such as curved text, which are actually very common in real-world scenarios. To tackle this problem, we propose a more flexible representation for scene text, termed as TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms. In TextSnake, a text instance is described as a sequence of ordered, overlapping disks centered at symmetric axes, each of which is associated with potentially variable radius and orientation. Such geometry attributes are estimated via a Fully Convolutional Network (FCN) model. In experiments, the text detector based on TextSnake achieves state-of-the-art or comparable performance on Total-Text and SCUT-CTW1500, the two newly published benchmarks with special emphasis on curved text in natural images, as well as the widely-used datasets ICDAR 2015 and MSRA-TD500. Specifically, TextSnake outperforms the baseline on Total-Text by more than 40% in F-measure.

preprint2020arXiv

The $L^p$ Teichmüller theory: Existence and regularity of critical points

We study minimisers of the $p$-conformal energy functionals, \[ \mathsf{E}_p(f):=\int_\ID \IK^p(z,f)\,dz,\quad f|_\IS=f_0|_\IS, \] defined for self mappings $f:\ID\to\ID$ with finite distortion and prescribed boundary values $f_0$. Here \[ \IK(z,f) = \frac{\|Df(z)\|^2}{J(z,f)} = \frac{1+|μ_f(z)|^2}{1-|μ_f(z)|^2}\] is the pointwise distortion functional and $μ_f(z)$ is the Beltrami coefficient of $f$. We show that for quasisymmetric boundary data the limiting regimes $p\to\infty$ recover the classical Teichmüller theory of extremal quasiconformal mappings (in part a result of Ahlfors), and for $p\to1$ recovers the harmonic mapping theory. Critical points of $\mathsf{E}_p$ always satisfy the inner-variational distributional equation \[ 2p\int_\ID \IK^p\;\frac{\overline{μ_f}}{1+|μ_f|^2}φ_\zbar \; dz=\int_\ID \IK^p \; φ_z\; dz,\quad\forallφ\in C_0^\infty(\ID ). \] We establish the existence of minimisers in the {\em a priori} regularity class $W^{1,\frac{2p}{p+1}}(\ID)$ and show these minimisers have a pseudo-inverse - a continuous $W^{1,2}(\ID)$ surjection of $\ID$ with $(h\circ f)(z)=z$ almost everywhere. We then give a sufficient condition to ensure $C^{\infty}(\ID)$ smoothness of solutions to the distributional equation. For instance $\IK(z,f)\in L^r_{loc}(\ID)$ for any $r>p+1$ is enough to imply the solutions to the distributional equation are local diffeomorphisms. Further $\IK(w,h)\in L^1(\ID)$ will imply $h$ is a homeomorphism, and together these results yield a diffeomorphic minimiser. We show such higher regularity assumptions to be necessary for critical points of the inner variational equation.

preprint2020arXiv

UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World

Synthetic data has been a critical tool for training scene text detection and recognition models. On the one hand, synthetic word images have proven to be a successful substitute for real images in training scene text recognizers. On the other hand, however, scene text detectors still heavily rely on a large amount of manually annotated real-world images, which are expensive. In this paper, we introduce UnrealText, an efficient image synthesis method that renders realistic images via a 3D graphics engine. 3D synthetic engine provides realistic appearance by rendering scene and text as a whole, and allows for better text region proposals with access to precise scene information, e.g. normal and even object meshes. The comprehensive experiments verify its effectiveness on both scene text detection and recognition. We also generate a multilingual version for future research into multilingual scene text detection and recognition. Additionally, we re-annotate scene text recognition datasets in a case-sensitive way and include punctuation marks for more comprehensive evaluations. The code and the generated datasets are released at: https://github.com/Jyouhou/UnrealText/ .

preprint2016arXiv

Effective Quantization Methods for Recurrent Neural Networks

Reducing bit-widths of weights, activations, and gradients of a Neural Network can shrink its storage size and memory usage, and also allow for faster training and inference by exploiting bitwise operations. However, previous attempts for quantization of RNNs show considerable performance degradation when using low bit-width weights and activations. In this paper, we propose methods to quantize the structure of gates and interlinks in LSTM and GRU cells. In addition, we propose balanced quantization methods for weights to further reduce performance degradation. Experiments on PTB and IMDB datasets confirm effectiveness of our methods as performances of our models match or surpass the previous state-of-the-art of quantized RNN.

preprint2016arXiv

Incidental Scene Text Understanding: Recent Progresses on ICDAR 2015 Robust Reading Competition Challenge 4

Different from focused texts present in natural images, which are captured with user's intention and intervention, incidental texts usually exhibit much more diversity, variability and complexity, thus posing significant difficulties and challenges for scene text detection and recognition algorithms. The ICDAR 2015 Robust Reading Competition Challenge 4 was launched to assess the performance of existing scene text detection and recognition methods on incidental texts as well as to stimulate novel ideas and solutions. This report is dedicated to briefly introduce our strategies for this challenging problem and compare them with prior arts in this field.

preprint2016arXiv

Multi-Oriented Text Detection with Fully Convolutional Networks

In this paper, we propose a novel approach for text detec- tion in natural images. Both local and global cues are taken into account for localizing text lines in a coarse-to-fine pro- cedure. First, a Fully Convolutional Network (FCN) model is trained to predict the salient map of text regions in a holistic manner. Then, text line hypotheses are estimated by combining the salient map and character components. Fi- nally, another FCN classifier is used to predict the centroid of each character, in order to remove the false hypotheses. The framework is general for handling text in multiple ori- entations, languages and fonts. The proposed method con- sistently achieves the state-of-the-art performance on three text detection benchmarks: MSRA-TD500, ICDAR2015 and ICDAR2013.

preprint2016arXiv

Robust Scene Text Recognition with Automatic Rectification

Recognizing text in natural images is a challenging task with many unsolved problems. Different from those in documents, words in natural images often possess irregular shapes, which are caused by perspective distortion, curved character placement, etc. We propose RARE (Robust text recognizer with Automatic REctification), a recognition model that is robust to irregular text. RARE is a specially-designed deep neural network, which consists of a Spatial Transformer Network (STN) and a Sequence Recognition Network (SRN). In testing, an image is firstly rectified via a predicted Thin-Plate-Spline (TPS) transformation, into a more "readable" image for the following SRN, which recognizes text through a sequence recognition approach. We show that the model is able to recognize several types of irregular text, including perspective text and curved text. RARE is end-to-end trainable, requiring only images and associated text labels, making it convenient to train and deploy the model in practical systems. State-of-the-art or highly-competitive performance achieved on several benchmarks well demonstrates the effectiveness of the proposed model.

preprint2016arXiv

Scene Text Detection via Holistic, Multi-Channel Prediction

Recently, scene text detection has become an active research topic in computer vision and document analysis, because of its great importance and significant challenge. However, vast majority of the existing methods detect text within local regions, typically through extracting character, word or line level candidates followed by candidate aggregation and false positive elimination, which potentially exclude the effect of wide-scope and long-range contextual cues in the scene. To take full advantage of the rich information available in the whole natural image, we propose to localize text in a holistic manner, by casting scene text detection as a semantic segmentation problem. The proposed algorithm directly runs on full images and produces global, pixel-wise prediction maps, in which detections are subsequently formed. To better make use of the properties of text, three types of information regarding text region, individual characters and their relationship are estimated, with a single Fully Convolutional Network (FCN) model. With such predictions of text properties, the proposed algorithm can simultaneously handle horizontal, multi-oriented and curved text in real-world natural images. The experiments on standard benchmarks, including ICDAR 2013, ICDAR 2015 and MSRA-TD500, demonstrate that the proposed algorithm substantially outperforms previous state-of-the-art approaches. Moreover, we report the first baseline result on the recently-released, large-scale dataset COCO-Text.

preprint2016arXiv

Training Bit Fully Convolutional Network for Fast Semantic Segmentation

Fully convolutional neural networks give accurate, per-pixel prediction for input images and have applications like semantic segmentation. However, a typical FCN usually requires lots of floating point computation and large run-time memory, which effectively limits its usability. We propose a method to train Bit Fully Convolution Network (BFCN), a fully convolutional neural network that has low bit-width weights and activations. Because most of its computation-intensive convolutions are accomplished between low bit-width numbers, a BFCN can be accelerated by an efficient bit-convolution implementation. On CPU, the dot product operation between two bit vectors can be reduced to bitwise operations and popcounts, which can offer much higher throughput than 32-bit multiplications and additions. To validate the effectiveness of BFCN, we conduct experiments on the PASCAL VOC 2012 semantic segmentation task and Cityscapes. Our BFCN with 1-bit weights and 2-bit activations, which runs 7.8x faster on CPU or requires less than 1\% resources on FPGA, can achieve comparable performance as the 32-bit counterpart.

preprint2015arXiv

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

preprint2015arXiv

Automatic Script Identification in the Wild

With the rapid increase of transnational communication and cooperation, people frequently encounter multilingual scenarios in various situations. In this paper, we are concerned with a relatively new problem: script identification at word or line levels in natural scenes. A large-scale dataset with a great quantity of natural images and 10 types of widely used languages is constructed and released. In allusion to the challenges in script identification in real-world scenarios, a deep learning based algorithm is proposed. The experiments on the proposed dataset demonstrate that our algorithm achieves superior performance, compared with conventional image classification methods, such as the original CNN architecture and LLC.

preprint2015arXiv

ICDAR 2015 Text Reading in the Wild Competition

Recently, text detection and recognition in natural scenes are becoming increasing popular in the computer vision community as well as the document analysis community. However, majority of the existing ideas, algorithms and systems are specifically designed for English. This technical report presents the final results of the ICDAR 2015 Text Reading in the Wild (TRW 2015) competition, which aims at establishing a benchmark for assessing detection and recognition algorithms devised for both Chinese and English scripts and providing a playground for researchers from the community. In this article, we describe in detail the dataset, tasks, evaluation protocols and participants of this competition, and report the performance of the participating methods. Moreover, promising directions for future research are discussed.

preprint2015arXiv

Relaxed Multiple-Instance SVM with Application to Object Discovery

Multiple-instance learning (MIL) has served as an important tool for a wide range of vision applications, for instance, image classification, object detection, and visual tracking. In this paper, we propose a novel method to solve the classical MIL problem, named relaxed multiple-instance SVM (RMI-SVM). We treat the positiveness of instance as a continuous variable, use Noisy-OR model to enforce the MIL constraints, and jointly optimize the bag label and instance label in a unified framework. The optimization problem can be efficiently solved using stochastic gradient decent. The extensive experiments demonstrate that RMI-SVM consistently achieves superior performance on various benchmarks for MIL. Moreover, we simply applied RMI-SVM to a challenging vision task, common object discovery. The state-of-the-art results of object discovery on Pascal VOC datasets further confirm the advantages of the proposed method.

preprint2014arXiv

Deep Learning Representation using Autoencoder for 3D Shape Retrieval

We study the problem of how to build a deep learning representation for 3D shape. Deep learning has shown to be very effective in variety of visual applications, such as image classification and object detection. However, it has not been successfully applied to 3D shape recognition. This is because 3D shape has complex structure in 3D space and there are limited number of 3D shapes for feature learning. To address these problems, we project 3D shapes into 2D space and use autoencoder for feature learning on the 2D images. High accuracy 3D shape retrieval performance is obtained by aggregating the features learned on 2D images. In addition, we show the proposed deep learning feature is complementary to conventional local image descriptors. By combing the global deep learning representation and the local descriptor representation, our method can obtain the state-of-the-art performance on 3D shape retrieval benchmarks.

Cong Yao

What is connected

Connect this record

See the researcher in context

Building this map preview

26 published item(s)

Diffeomorphic solutions of Ahlfors-Hopf equations

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

LORE++: Logical Location Regression Network for Table Structure Recognition with Pre-training

On the uniqueness of extremal mappings of finite distortion

Revisiting Document Image Dewarping by Grid Regularization

Vision-Language Pre-Training for Boosting Scene Text Detectors

A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling

Differentiable Feature Aggregation Search for Knowledge Distillation

Higher regularity and uniqueness for inner variational equations

On Vocabulary Reliance in Scene Text Recognition

Scene Text Detection and Recognition: The Deep Learning Era

TextScanner: Reading Characters in Order for Robust Scene Text Recognition

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

The $L^p$ Teichmüller theory: Existence and regularity of critical points

UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World

Effective Quantization Methods for Recurrent Neural Networks

Incidental Scene Text Understanding: Recent Progresses on ICDAR 2015 Robust Reading Competition Challenge 4

Multi-Oriented Text Detection with Fully Convolutional Networks

Robust Scene Text Recognition with Automatic Rectification

Scene Text Detection via Holistic, Multi-Channel Prediction

Training Bit Fully Convolutional Network for Fast Semantic Segmentation

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Automatic Script Identification in the Wild

ICDAR 2015 Text Reading in the Wild Competition

Relaxed Multiple-Instance SVM with Application to Object Discovery

Deep Learning Representation using Autoencoder for 3D Shape Retrieval