Researcher profile

Cong Yao

Cong Yao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
15works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

15 published item(s)

preprint2026arXiv

Diffeomorphic solutions of Ahlfors-Hopf equations

Here we advance the study of boundary the value problem for extremal functions of mean distortion and the associated Teichmüller spaces interpolating between the classical examples of extremal quasiconformal mappings, and the more recent approach through harmonic mappings (of extreme Dirichlet energy). In this paper we focus on the Alhfors-Hopf differential \[ Φ=\mathcal{A}(\mathbb{K}(w,h))h_w\,\overline{h_{\overline{w}}}\, η(h), \] where $h=f^{-1}$ is the pseudo-inverse of an extremal mapping $f$ for the problem \[ \inf_{f:\mathbb{D}\to\mathbb{D}}\int_\mathbb{D} \mathcal{A}(\mathbb{K}(z,f)) \; dz, \quad\quad \mathbb{K}(z,f) = \frac{|f_z|^2+|f_{\overline{z}}|^2}{|f_z|^2-|f_{\overline{z}}|^2}. \] where the infimum is taken over those homeomorphisms of finite distortion $f:\overline{\mathbb{D}}\to\overline{\mathbb{D}}$ with $f|\mathbb{S}=f_0$, typically a quasisymmetric barrier function. The inner-variational equations, an analogue of the Euler-Lagrange equations, show $Φ$ is holomorphic at an extremal. Exploiting this Ahlfors-Hopf differential, we prove that an extreme point $f$ is a local diffeomorphism in $\mathbb{D}$, resolving some conjectures in [16].

preprint2026arXiv

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents

We present GLM-5V-Turbo, a step toward native foundation models for multimodal agents. As foundation models are increasingly deployed in real environments, agentic capability depends not only on language reasoning, but also on the ability to perceive, interpret, and act over heterogeneous contexts such as images, videos, webpages, documents, GUIs. GLM-5V-Turbo is built around this objective: multimodal perception is integrated as a core component of reasoning, planning, tool use, and execution, rather than as an auxiliary interface to a language model. This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks, while preserving competitive text-only coding capability. More importantly, our development process offers practical insights for building multimodal agents, highlighting the central role of multimodal perception, hierarchical optimization, and reliable end-to-end verification.

preprint2024arXiv

LORE++: Logical Location Regression Network for Table Structure Recognition with Pre-training

Table structure recognition (TSR) aims at extracting tables in images into machine-understandable formats. Recent methods solve this problem by predicting the adjacency relations of detected cell boxes or learning to directly generate the corresponding markup sequences from the table images. However, existing approaches either count on additional heuristic rules to recover the table structures, or face challenges in capturing long-range dependencies within tables, resulting in increased complexity. In this paper, we propose an alternative paradigm. We model TSR as a logical location regression problem and propose a new TSR framework called LORE, standing for LOgical location REgression network, which for the first time regresses logical location as well as spatial location of table cells in a unified network. Our proposed LORE is conceptually simpler, easier to train, and more accurate than other paradigms of TSR. Moreover, inspired by the persuasive success of pre-trained models on a number of computer vision and natural language processing tasks, we propose two pre-training tasks to enrich the spatial and logical representations at the feature level of LORE, resulting in an upgraded version called LORE++. The incorporation of pre-training in LORE++ has proven to enjoy significant advantages, leading to a substantial enhancement in terms of accuracy, generalization, and few-shot capability compared to its predecessor. Experiments on standard benchmarks against methods of previous paradigms demonstrate the superiority of LORE++, which highlights the potential and promising prospect of the logical location regression paradigm for TSR.

preprint2022arXiv

On the uniqueness of extremal mappings of finite distortion

For an arbitrary convex function $Ψ:[1,\infty) \to [1,\infty)$, we consider uniqueness in the following two related extremal problems: Problem A boundary value problem: Establish the existence of, and describe the mapping $f$, achieving \[ \inf_f \Big\{ \int_{\Bbb D} Ψ({\Bbb K}(z,f))\; dz : f:\bar{\Bbb D} \to \bar{\Bbb D} \; \mbox{a homeomorphism in $W^{1,1}_{0}({\Bbb D})+f_0$} \Big\}. \] Here the data $f_0:\bar{\Bbb D} \to \bar{\Bbb D}$ is a homeomorphism of finite distortion with $\int_{\Bbb D} Ψ({\Bbb K}(z,f_0))\; dz<\infty$ -- a barrier. Next, given two homeomorphic Riemann surfaces $R$ and $S$ and data $f_0:R \to S$ a diffeomorphism. \noindent{\bf Problem B} {\em (extremal in homotopy class):} Establish the existence of, and describe the mapping $f$, achieving \[ \inf_f \Big\{ \int_R Ψ({\Bbb K}(z,f))\; \;dσ(z) : \mbox{$f$ a homeomorphism homotopic to $f_0$} \Big\}. \] There are two basic obstructions to existence and regularity. These are first, the existence of an Ahlfors-Hopf differential and second that the minimiser is a homeomorphism. When these restrictions are met (as they often can be) we show uniqueness is assured. These results are established through a generalisation the classical Reich-Strebel inequalities to this variational setting.

preprint2022arXiv

Revisiting Document Image Dewarping by Grid Regularization

This paper addresses the problem of document image dewarping, which aims at eliminating the geometric distortion in document images for document digitization. Instead of designing a better neural network to approximate the optical flow fields between the inputs and outputs, we pursue the best readability by taking the text lines and the document boundaries into account from a constrained optimization perspective. Specifically, our proposed method first learns the boundary points and the pixels in the text lines and then follows the most simple observation that the boundaries and text lines in both horizontal and vertical directions should be kept after dewarping to introduce a novel grid regularization scheme. To obtain the final forward mapping for dewarping, we solve an optimization problem with our proposed grid regularization. The experiments comprehensively demonstrate that our proposed approach outperforms the prior arts by large margins in terms of readability (with the metrics of Character Errors Rate and the Edit Distance) while maintaining the best image quality on the publicly-available DocUNet benchmark.

preprint2022arXiv

Vision-Language Pre-Training for Boosting Scene Text Detectors

Recently, vision-language joint representation learning has proven to be highly effective in various scenarios. In this paper, we specifically adapt vision-language joint learning for scene text detection, a task that intrinsically involves cross-modal interaction between the two modalities: vision and language, since text is the written form of language. Concretely, we propose to learn contextualized, joint representations through vision-language pre-training, for the sake of enhancing the performance of scene text detectors. Towards this end, we devise a pre-training architecture with an image encoder, a text encoder and a cross-modal encoder, as well as three pretext tasks: image-text contrastive learning (ITC), masked language modeling (MLM) and word-in-image prediction (WIP). The pre-trained model is able to produce more informative representations with richer semantics, which could readily benefit existing scene text detectors (such as EAST and PSENet) in the down-stream text detection task. Extensive experiments on standard benchmarks demonstrate that the proposed paradigm can significantly improve the performance of various representative text detectors, outperforming previous pre-training approaches. The code and pre-trained models will be publicly released.

preprint2020arXiv

A New Perspective for Flexible Feature Gathering in Scene Text Recognition Via Character Anchor Pooling

Irregular scene text recognition has attracted much attention from the research community, mainly due to the complexity of shapes of text in natural scene. However, recent methods either rely on shape-sensitive modules such as bounding box regression, or discard sequence learning. To tackle these issues, we propose a pair of coupling modules, termed as Character Anchoring Module (CAM) and Anchor Pooling Module (APM), to extract high-level semantics from two-dimensional space to form feature sequences. The proposed CAM localizes the text in a shape-insensitive way by design by anchoring characters individually. APM then interpolates and gathers features flexibly along the character anchors which enables sequence learning. The complementary modules realize a harmonic unification of spatial information and sequence learning. With the proposed modules, our recognition system surpasses previous state-of-the-art scores on irregular and perspective text datasets, including, ICDAR 2015, CUTE, and Total-Text, while paralleling state-of-the-art performance on regular text datasets.

preprint2020arXiv

Differentiable Feature Aggregation Search for Knowledge Distillation

Knowledge distillation has become increasingly important in model compression. It boosts the performance of a miniaturized student network with the supervision of the output distribution and feature maps from a sophisticated teacher network. Some recent works introduce multi-teacher distillation to provide more supervision to the student network. However, the effectiveness of multi-teacher distillation methods are accompanied by costly computation resources. To tackle with both the efficiency and the effectiveness of knowledge distillation, we introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework by extracting informative supervision from multiple teacher feature maps. Specifically, we introduce DFA, a two-stage Differentiable Feature Aggregation search method that motivated by DARTS in neural architecture search, to efficiently find the aggregations. In the first stage, DFA formulates the searching problem as a bi-level optimization and leverages a novel bridge loss, which consists of a student-to-teacher path and a teacher-to-student path, to find appropriate feature aggregations. The two paths act as two players against each other, trying to optimize the unified architecture parameters to the opposite directions while guaranteeing both expressivity and learnability of the feature aggregation simultaneously. In the second stage, DFA performs knowledge distillation with the derived feature aggregation. Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets under various teacher-student settings, verifying the effectiveness and robustness of the design.

preprint2020arXiv

Higher regularity and uniqueness for inner variational equations

We study local minima of the $p$-conformal energy functionals, \[ \mathsf{E}_{\cal A}^\ast(h):=\int_\ID {\cal A}(\IK(w,h)) \;J(w,h) \; dw,\quad h|_\IS=h_0|_\IS, \] defined for self mappings $h:\ID\to\ID$ with finite distortion of the unit disk with prescribed boundary values $h_0$. Here $\IK(w,h) = \frac{\|Dh(w)\|^2}{J(w,h)} $ is the pointwise distortion functional, and ${\cal A}:[1,\infty)\to [1,\infty)$ is convex and increasing with ${\cal A}(t)\approx t^p$ for some $p\geq 1$, with additional minor technical conditions. Note ${\cal A}(t)=t$ is the Dirichlet energy functional. Critical points of $\mathsf{E}_{\cal A}^\ast$ satisfy the Ahlfors-Hopf inner-variational equation \[ {\cal A}&#39;(\IK(w,h)) h_w \overline{h_\wbar} = Φ\] where $Φ$ is a holomorphic function. Iwaniec, Kovalev and Onninen established the Lipschitz regularity of critical points. Here we give a sufficient condition to ensure that a local minimum is a diffeomorphic solution to this equation, and that it is unique. This condition is necessarily satisfied by any locally quasiconformal critical point, and is basically the assumption $\IK(w,h)\in L^1(\ID)\cap L^r_{loc}(\ID)$ for some $r>1$.

preprint2020arXiv

On Vocabulary Reliance in Scene Text Recognition

The pursuit of high performance on public benchmarks has been the driving force for research in scene text recognition, and notable progress has been achieved. However, a close investigation reveals a startling fact that the state-of-the-art methods perform well on images with words within vocabulary but generalize poorly to images with words outside vocabulary. We call this phenomenon &#34;vocabulary reliance&#34;. In this paper, we establish an analytical framework to conduct an in-depth study on the problem of vocabulary reliance in scene text recognition. Key findings include: (1) Vocabulary reliance is ubiquitous, i.e., all existing algorithms more or less exhibit such characteristic; (2) Attention-based decoders prove weak in generalizing to words outside vocabulary and segmentation-based decoders perform well in utilizing visual features; (3) Context modeling is highly coupled with the prediction layers. These findings provide new insights and can benefit future research in scene text recognition. Furthermore, we propose a simple yet effective mutual learning strategy to allow models of two families (attention-based and segmentation-based) to learn collaboratively. This remedy alleviates the problem of vocabulary reliance and improves the overall scene text recognition performance.

preprint2020arXiv

Scene Text Detection and Recognition: The Deep Learning Era

With the rise and development of deep learning, computer vision has been tremendously transformed and reshaped. As an important research area in computer vision, scene text detection and recognition has been inescapably influenced by this wave of revolution, consequentially entering the era of deep learning. In recent years, the community has witnessed substantial advancements in mindset, approach and performance. This survey is aimed at summarizing and analyzing the major changes and significant progresses of scene text detection and recognition in the deep learning era. Through this article, we devote to: (1) introduce new insights and ideas; (2) highlight recent techniques and benchmarks; (3) look ahead into future trends. Specifically, we will emphasize the dramatic differences brought by deep learning and the grand challenges still remained. We expect that this review paper would serve as a reference book for researchers in this field. Related resources are also collected and compiled in our Github repository: https://github.com/Jyouhou/SceneTextPapers.

preprint2020arXiv

TextScanner: Reading Characters in Order for Robust Scene Text Recognition

Driven by deep learning and the large volume of data, scene text recognition has evolved rapidly in recent years. Formerly, RNN-attention based methods have dominated this field, but suffer from the problem of \textit{attention drift} in certain situations. Lately, semantic segmentation based algorithms have proven effective at recognizing text of different forms (horizontal, oriented and curved). However, these methods may produce spurious characters or miss genuine characters, as they rely heavily on a thresholding procedure operated on segmentation maps. To tackle these challenges, we propose in this paper an alternative approach, called TextScanner, for scene text recognition. TextScanner bears three characteristics: (1) Basically, it belongs to the semantic segmentation family, as it generates pixel-wise, multi-channel segmentation maps for character class, position and order; (2) Meanwhile, akin to RNN-attention based methods, it also adopts RNN for context modeling; (3) Moreover, it performs paralleled prediction for character position and class, and ensures that characters are transcripted in correct order. The experiments on standard benchmark datasets demonstrate that TextScanner outperforms the state-of-the-art methods. Moreover, TextScanner shows its superiority in recognizing more difficult text such Chinese transcripts and aligning with target characters.

preprint2020arXiv

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Driven by deep neural networks and large scale datasets, scene text detection methods have progressed substantially over the past years, continuously refreshing the performance records on various standard benchmarks. However, limited by the representations (axis-aligned rectangles, rotated rectangles or quadrangles) adopted to describe text, existing methods may fall short when dealing with much more free-form text instances, such as curved text, which are actually very common in real-world scenarios. To tackle this problem, we propose a more flexible representation for scene text, termed as TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms. In TextSnake, a text instance is described as a sequence of ordered, overlapping disks centered at symmetric axes, each of which is associated with potentially variable radius and orientation. Such geometry attributes are estimated via a Fully Convolutional Network (FCN) model. In experiments, the text detector based on TextSnake achieves state-of-the-art or comparable performance on Total-Text and SCUT-CTW1500, the two newly published benchmarks with special emphasis on curved text in natural images, as well as the widely-used datasets ICDAR 2015 and MSRA-TD500. Specifically, TextSnake outperforms the baseline on Total-Text by more than 40% in F-measure.

preprint2020arXiv

The $L^p$ Teichmüller theory: Existence and regularity of critical points

We study minimisers of the $p$-conformal energy functionals, \[ \mathsf{E}_p(f):=\int_\ID \IK^p(z,f)\,dz,\quad f|_\IS=f_0|_\IS, \] defined for self mappings $f:\ID\to\ID$ with finite distortion and prescribed boundary values $f_0$. Here \[ \IK(z,f) = \frac{\|Df(z)\|^2}{J(z,f)} = \frac{1+|μ_f(z)|^2}{1-|μ_f(z)|^2}\] is the pointwise distortion functional and $μ_f(z)$ is the Beltrami coefficient of $f$. We show that for quasisymmetric boundary data the limiting regimes $p\to\infty$ recover the classical Teichmüller theory of extremal quasiconformal mappings (in part a result of Ahlfors), and for $p\to1$ recovers the harmonic mapping theory. Critical points of $\mathsf{E}_p$ always satisfy the inner-variational distributional equation \[ 2p\int_\ID \IK^p\;\frac{\overline{μ_f}}{1+|μ_f|^2}φ_\zbar \; dz=\int_\ID \IK^p \; φ_z\; dz,\quad\forallφ\in C_0^\infty(\ID ). \] We establish the existence of minimisers in the {\em a priori} regularity class $W^{1,\frac{2p}{p+1}}(\ID)$ and show these minimisers have a pseudo-inverse - a continuous $W^{1,2}(\ID)$ surjection of $\ID$ with $(h\circ f)(z)=z$ almost everywhere. We then give a sufficient condition to ensure $C^{\infty}(\ID)$ smoothness of solutions to the distributional equation. For instance $\IK(z,f)\in L^r_{loc}(\ID)$ for any $r>p+1$ is enough to imply the solutions to the distributional equation are local diffeomorphisms. Further $\IK(w,h)\in L^1(\ID)$ will imply $h$ is a homeomorphism, and together these results yield a diffeomorphic minimiser. We show such higher regularity assumptions to be necessary for critical points of the inner variational equation.

preprint2020arXiv

UnrealText: Synthesizing Realistic Scene Text Images from the Unreal World

Synthetic data has been a critical tool for training scene text detection and recognition models. On the one hand, synthetic word images have proven to be a successful substitute for real images in training scene text recognizers. On the other hand, however, scene text detectors still heavily rely on a large amount of manually annotated real-world images, which are expensive. In this paper, we introduce UnrealText, an efficient image synthesis method that renders realistic images via a 3D graphics engine. 3D synthetic engine provides realistic appearance by rendering scene and text as a whole, and allows for better text region proposals with access to precise scene information, e.g. normal and even object meshes. The comprehensive experiments verify its effectiveness on both scene text detection and recognition. We also generate a multilingual version for future research into multilingual scene text detection and recognition. Additionally, we re-annotate scene text recognition datasets in a case-sensitive way and include punctuation marks for more comprehensive evaluations. The code and the generated datasets are released at: https://github.com/Jyouhou/UnrealText/ .