Source author record

Xiaoyi Zhang

Xiaoyi Zhang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.AP Human-Computer Interaction Computer Vision Computation and Language Information Theory math.IT Artificial Intelligence math-ph math.MP math.OC Multimedia Performance Systems and Control

Catalog footprint

What is connected

22works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

An Efficient Streaming Video Understanding Framework with Agentic Control

Streaming video requires handling dynamic information density under strict latency budgets. Yet, existing methods typically employ static strategies, such as fixed memory compression or reliance on a single model, forcing a trade-off: fast models fail on complex queries, while always-on heavy models violate real-time constraints and overcomplicate simple queries. Rather than fixing these decisions upfront, we propose R3-Streaming (Remember, Respond, Reason), which formulates streaming video understanding as a cascaded control problem: for each query, the system compresses memory, judges response readiness, and routes computation sequentially, so that each downstream decision builds on progressively refined information states. To optimize this pipeline, we introduce an age-aware forgetting policy for memory compression, as aggressively compressing historical frames can yield substantial performance gains. For compute routing, we propose TB-GRPO, a target-balanced reinforcement learning objective that routes hard queries to a stronger model while preventing mode collapse. Extensive evaluations demonstrate that R3-Streaming achieves state-of-the-art results among streaming MLLMs, reaching 57.92 on OVO-Bench and 76.36 on StreamingBench, while reducing visual token usage by 95 to 96 percent.

preprint2026arXiv

InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present InfiniteWeb, a system that automatically generates functional web environments at scale for GUI agent training. While LLMs perform well on generating a single webpage, building a realistic and functional website with many interconnected pages faces challenges. We address these challenges through unified specification, task-centric test-driven development, and a combination of website seed with reference design image to ensure diversity. Our system also generates verifiable task evaluators enabling dense reward signals for reinforcement learning. Experiments show that InfiniteWeb surpasses commercial coding agents at realistic website construction, and GUI agents trained on our generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web, demonstrating the effectiveness of proposed system.

preprint2023arXiv

Screen Correspondence: Mapping Interchangeable Elements between UIs

Understanding user interface (UI) functionality is a useful yet challenging task for both machines and people. In this paper, we investigate a machine learning approach for screen correspondence, which allows reasoning about UIs by mapping their elements onto previously encountered examples with known functionality and properties. We describe and implement a model that incorporates element semantics, appearance, and text to support correspondence computation without requiring any labeled examples. Through a comprehensive performance evaluation, we show that our approach improves upon baselines by incorporating multi-modal properties of UIs. Finally, we show three example applications where screen correspondence facilitates better UI understanding for humans and machines: (i) instructional overlay generation, (ii) semantic UI element search, and (iii) automated interface testing.

preprint2022arXiv

Extracting Replayable Interactions from Videos of Mobile App Usage

Screen recordings of mobile apps are a popular and readily available way for users to share how they interact with apps, such as in online tutorial videos, user reviews, or as attachments in bug reports. Unfortunately, both people and systems can find it difficult to reproduce touch-driven interactions from video pixel data alone. In this paper, we introduce an approach to extract and replay user interactions in videos of mobile apps, using only pixel information in video frames. To identify interactions, we apply heuristic-based image processing and convolutional deep learning to segment screen recordings, classify the interaction in each segment, and locate the interaction point. To replay interactions on another device, we match elements on app screens using UI element detection. We evaluate the feasibility of our pixel-based approach using two datasets: the Rico mobile app dataset and a new dataset of 64 apps with both iOS and Android versions. We find that our end-to-end approach can successfully replay a majority of interactions (iOS--84.1%, Android--78.4%) on different devices, which is a step towards supporting a variety of scenarios, including automatically annotating interactions in existing videos, automated UI testing, and creating interactive app tutorials.

preprint2022arXiv

Reflow: Automatically Improving Touch Interactions in Mobile Applications through Pixel-based Refinements

Touch is the primary way that users interact with smartphones. However, building mobile user interfaces where touch interactions work well for all users is a difficult problem, because users have different abilities and preferences. We propose a system, Reflow, which automatically applies small, personalized UI adaptations, called refinements -- to mobile app screens to improve touch efficiency. Reflow uses a pixel-based strategy to work with existing applications, and improves touch efficiency while minimally disrupting the design intent of the original application. Our system optimizes a UI by (i) extracting its layout from its screenshot, (ii) refining its layout, and (iii) re-rendering the UI to reflect these modifications. We conducted a user study with 10 participants and a heuristic evaluation with 6 experts and found that applications optimized by Reflow led to, on average, 9% faster selection time with minimal layout disruption. The results demonstrate that Reflow's refinements useful UI adaptations to improve touch interactions.

preprint2022arXiv

Understanding Mobile GUI: from Pixel-Words to Screen-Sentences

The ubiquity of mobile phones makes mobile GUI understanding an important task. Most previous works in this domain require human-created metadata of screens (e.g. View Hierarchy) during inference, which unfortunately is often not available or reliable enough for GUI understanding. Inspired by the impressive success of Transformers in NLP tasks, targeting for purely vision-based GUI understanding, we extend the concepts of Words/Sentence to Pixel-Words/Screen-Sentence, and propose a mobile GUI understanding architecture: Pixel-Words to Screen-Sentence (PW2SS). In analogy to the individual Words, we define the Pixel-Words as atomic visual components (text and graphic components), which are visually consistent and semantically clear across screenshots of a large variety of design styles. The Pixel-Words extracted from a screenshot are aggregated into Screen-Sentence with a Screen Transformer proposed to model their relations. Since the Pixel-Words are defined as atomic visual components, the ambiguity between their visual appearance and semantics is dramatically reduced. We are able to make use of metadata available in training data to auto-generate high-quality annotations for Pixel-Words. A dataset, RICO-PW, of screenshots with Pixel-Words annotations is built based on the public RICO dataset, which will be released to help to address the lack of high-quality training data in this area. We train a detector to extract Pixel-Words from screenshots on this dataset and achieve metadata-free GUI understanding during inference. We conduct experiments and show that Pixel-Words can be well extracted on RICO-PW and well generalized to a new dataset, P2S-UI, collected by ourselves. The effectiveness of PW2SS is further verified in the GUI understanding tasks including relation prediction, clickability prediction, screen retrieval, and app type classification.

preprint2021arXiv

Optimal defined contribution pension management with jump diffusions and common shock dependence

This work deals with an optimal asset allocation problem for a defined contribution (DC) pension plan during its accumulation phase. The contribution rate is proportional to the individual's salary, the dynamics of which follows a Heston stochastic volatility model with jumps, and there are common shocks between the salary and the volatility. Since the time horizon of pension management might be long, the influence of inflation is considered in the context. The inflation index is subjected to a Poisson jump and a Brownian uncertainty. The pension plan aims to reduce fluctuations of terminal wealth by investing the fund in a financial market consisting of a riskless asset and a risky asset. The dynamics of the risky asset is given by a jump diffusion process. The closed form of investment decision is derived by the dynamic programming approach.

preprint2021arXiv

Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels

Many accessibility features available on mobile platforms require applications (apps) to provide complete and accurate metadata describing user interface (UI) components. Unfortunately, many apps do not provide sufficient metadata for accessibility features to work as expected. In this paper, we explore inferring accessibility metadata for mobile apps from their pixels, as the visual interfaces often best reflect an app's full functionality. We trained a robust, fast, memory-efficient, on-device model to detect UI elements using a dataset of 77,637 screens (from 4,068 iPhone apps) that we collected and annotated. To further improve UI detections and add semantic information, we introduced heuristics (e.g., UI grouping and ordering) and additional models (e.g., recognize UI content, state, interactivity). We built Screen Recognition to generate accessibility metadata to augment iOS VoiceOver. In a study with 9 screen reader users, we validated that our approach improves the accessibility of existing mobile apps, enabling even previously inaccessible apps to be used.

preprint2020arXiv

A Game-Theoretic Approach to Decision Making for Multiple Vehicles at Roundabout

In this paper, we study the decision making of multiple autonomous vehicles at a roundabout. The behaviours of the vehicles depend on their aggressiveness, which indicates how much they value speed over safety. We propose a distributed decision-making process that balances safety and speed of the vehicles. In the proposed process, each vehicle estimates other vehicles' aggressiveness and formulates the interactions among the vehicles as a finite sequential game. Based on the Nash equilibrium of this game, the vehicle predicts other vehicles' behaviours and makes decisions. We perform numerical simulations to illustrate the effectiveness of the proposed process, both for safety (absence of collisions), and speed (time spent within the roundabout).

preprint2020arXiv

Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

While pretrained models such as BERT have shown large gains across natural language understanding tasks, their performance can be improved by further training the model on a data-rich intermediate task, before fine-tuning it on a target task. However, it is still poorly understood when and why intermediate-task training is beneficial for a given target task. To investigate this, we perform a large-scale study on the pretrained RoBERTa model with 110 intermediate-target task combinations. We further evaluate all trained models with 25 probing tasks meant to reveal the specific skills that drive transfer. We observe that intermediate tasks requiring high-level inference and reasoning abilities tend to work best. We also observe that target task performance is strongly correlated with higher-level abilities such as coreference resolution. However, we fail to observe more granular correlations between probing and target task performance, highlighting the need for further work on broad-coverage probing benchmarks. We also observe evidence that the forgetting of knowledge learned during pretraining may limit our analysis, highlighting the need for further work on transfer learning methods in these settings.

preprint2016arXiv

Finite-dimensional approximation and non-squeezing for the cubic nonlinear Schrödinger equation on $\R^2$

We prove symplectic non-squeezing (in the sense of Gromov) for the cubic nonlinear Schrödinger equation on $\R^2$. This is the first symplectic non-squeezing result for a Hamiltonian PDE in infinite volume. As the underlying symplectic Hilbert space is $L^2(\R^2)$, this requires working with initial data in this space. This space also happens to be scaling-critical for this equation. Thus, we also obtain the first unconditional symplectic non-squeezing result in such a critical setting. More generally, we show that solutions of this PDE can be approximated by a finite-dimensional Hamiltonian system, despite the wealth of non-compact symmetries: scaling, translation, and Galilei boosts. This approximation result holds uniformly on bounded sets of initial data. Complementing this approximation result, we show that all solutions of the finite-dimensional Hamiltonian system can be approximated by the full PDE. A key ingredient in these proofs is the development of a general methodology for obtaining uniform global space-time bounds for suitable Fourier truncations of dispersive PDE models.

preprint2016arXiv

Symplectic non-squeezing for the cubic NLS on the line

We prove symplectic non-squeezing for the cubic nonlinear Schrödinger equation on the line via finite-dimensional approximation.

preprint2015arXiv

The focusing cubic NLS on exterior domains in three dimensions

We consider the focusing cubic NLS in the exterior $Ω$ of a smooth, compact, strictly convex obstacle in three dimensions. We prove that the threshold for global existence and scattering is the same as for the problem posed on Euclidean space. Specifically, we prove that if $E(u_0)M(u_0)<E(Q)M(Q)$ and $\|\nabla u_0\|_2\|u_0\|_2<\|\nabla Q\|_2\|Q\|_2$, the corresponding solution to the initial-value problem with Dirichlet boundary conditions exists globally and scatters to linear evolutions asymptotically in the future and in the past. Here, $Q(x)$ denotes the ground state for the focusing cubic NLS in $\mathbb{R}^3$.

preprint2014arXiv

Riesz transforms outside a convex obstacle

The goal of this paper is to develop some basic harmonic analysis tools for the Dirichlet Laplacian in the exterior domain associated to a smooth convex obstacle in dimensions $d\geq 3$. Specifically, we will discuss analogues of the Mikhlin Multiplier Theorem, Littlewood-Paley Theory, and Hardy inequalities, culminating in a proof that homogeneous Sobolev norms defined with respect to the Dirichlet and whole-space Laplacians are equivalent for the sharp ranges of integrability exponent $p$ and regularity $s$. Counterexamples are included to show that these results are indeed sharp. In particular, we precisely settle the question of boundedness of Riesz transforms on $L^p$, including the endpoint. The utility of such results in the study of nonlinear PDE is that they allow us to deduce important results, such as the fractional product and chain rules for the Dirichlet Laplacian, directly from the classical Euclidean setting. As an application, we discuss the local well-posedness and stability problems for energy-critical NLS. All the results of this paper play an essential role in the authors' proof of large-data global well-posedness and scattering for the energy-critical NLS in three dimensional exterior domains; see arXiv:1208:4904.

preprint2012arXiv

A New Ensemble of Rate-Compatible LDPC Codes

In this paper, we presented three approaches to improve the design of Kite codes (newly proposed rateless codes), resulting in an ensemble of rate-compatible LDPC codes with code rates varying "continuously" from 0.1 to 0.9 for additive white Gaussian noise (AWGN) channels. The new ensemble rate-compatible LDPC codes can be constructed conveniently with an empirical formula. Simulation results show that, when applied to incremental redundancy hybrid automatic repeat request (IR-HARQ) system, the constructed codes (with higher order modulation) perform well in a wide range of signal-to-noise-ratios (SNRs).

preprint2012arXiv

Quintic NLS in the exterior of a strictly convex obstacle

We consider the defocusing energy-critical nonlinear Schrödinger equation in the exterior of a smooth compact strictly convex obstacle in three dimensions. For the initial-value problem with Dirichlet boundary condition we prove global well-posedness and scattering for all initial data in the energy space.

preprint2012arXiv

Remarks of Global Wellposedness of Liquid Crystal Flows and Heat Flows of Harmonic Maps in Two Dimensions

We consider the Cauchy problem to the two-dimensional incompressible liquid crystal equation and the heat flows of harmonic maps equation. Under a natural geometric angle condition, we give a new proof of the global well-posedness of smooth solutions for a class of large initial data in energy space. This result was originally obtained by Ding-Lin in \cite{DingLin} and Lin-Lin-Wang in \cite{LinLinWang}. Our main technical tool is a rigidity theorem which gives the coercivity of the harmonic energy under certain angle condition. Our proof is based on a frequency localization argument combined with the concentration-compactness approach which can be of independent interest.

preprint2011arXiv

Global well-posedness and scattering for defocusing energy-critical NLS in the exterior of balls with radial data

We consider the defocusing energy-critical NLS in the exterior of the unit ball in three dimensions. For the initial value problem with Dirichlet boundary condition we prove global well-posedness and scattering with large radial initial data in the Sobolev space $\dot H_0^1$. We also point out that the same strategy can be used to treat the energy-supercritical NLS in the exterior of balls with Dirichlet boundary condition and radial $\dot H_0^1$ initial data.

preprint2011arXiv

Serial Concatenation of RS Codes with Kite Codes: Performance Analysis, Iterative Decoding and Design

In this paper, we propose a new ensemble of rateless forward error correction (FEC) codes. The proposed codes are serially concatenated codes with Reed-Solomon (RS) codes as outer codes and Kite codes as inner codes. The inner Kite codes are a special class of prefix rateless low-density parity-check (PRLDPC) codes, which can generate potentially infinite (or as many as required) random-like parity-check bits. The employment of RS codes as outer codes not only lowers down error-floors but also ensures (with high probability) the correctness of successfully decoded codewords. In addition to the conventional two-stage decoding, iterative decoding between the inner code and the outer code are also implemented to improve the performance further. The performance of the Kite codes under maximum likelihood (ML) decoding is analyzed by applying a refined Divsalar bound to the ensemble weight enumerating functions (WEF). We propose a simulation-based optimization method as well as density evolution (DE) using Gaussian approximations (GA) to design the Kite codes. Numerical results along with semi-analytic bounds show that the proposed codes can approach Shannon limits with extremely low error-floors. It is also shown by simulation that the proposed codes performs well within a wide range of signal-to-noise-ratios (SNRs).

preprint2011arXiv

Smooth global solutions for the two dimensional Euler Poisson system

The Euler-Poisson system is a fundamental two-fluid model to describe the dynamics of the plasma consisting of compressible electrons and a uniform ion background. By using the dispersive Klein-Gordon effect, Guo \cite{Guo98} first constructed a global smooth irrotational solution in the three dimensional case. It has been conjectured that same results should hold in the two-dimensional case. The main difficulty in 2D comes from the slow dispersion of the linear flow and certain nonlocal resonant obstructions in the nonlinearity. In this paper we develop a new method to overcome these difficulties and construct smooth global solutions for the 2D Euler-Poisson system.

preprint2010arXiv

Energy-critical NLS with quadratic potentials

We consider the defocusing $\dot H^1$-critical nonlinear Schrödinger equation in all dimensions ($n\geq 3$) with a quadratic potential $V(x)=\pm \tfrac12 |x|^2$. We show global well-posedness for radial initial data obeying $\nabla u_0(x), xu_0(x) \in L^2$. In view of the potential $V$, this is the natural energy space. In the repulsive case, we also prove scattering. We follow the approach pioneered by Bourgain and Tao in the case of no potential; indeed, we include a proof of their results that incorporates a couple of simplifications discovered while treating the problem with quadratic potential.

preprint2010arXiv

Stability and Unconditional Uniqueness of Solutions for Energy Critical Wave Equations in High Dimensions

In this paper we establish a complete local theory for the energy-critical nonlinear wave equation (NLW) in high dimensions ${\mathbb R} \times {\mathbb R}^d$ with $d \geq 6$. We prove the stability of solutions under the weak condition that the perturbation of the linear flow is small in certain space-time norms. As a by-product of our stability analysis, we also prove local well-posedness of solutions for which we only assume the smallness of the linear evolution. These results provide essential technical tools that can be applied towards obtaining the extension to high dimensions of the analysis of Kenig and Merle \cite{keme06} of the dynamics of the focusing (NLW) below the energy threshold. By employing refined paraproduct estimates we also prove unconditional uniqueness of solutions for $d\ge 5$ in the natural energy class. This extends an earlier result by Furioli, Planchon and Terraneo \cite{FPT03} in dimension $d=4$.

Xiaoyi Zhang

What is connected

Connect this record

See the researcher in context

Building this map preview

22 published item(s)

An Efficient Streaming Video Understanding Framework with Agentic Control

InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

Screen Correspondence: Mapping Interchangeable Elements between UIs

Extracting Replayable Interactions from Videos of Mobile App Usage

Reflow: Automatically Improving Touch Interactions in Mobile Applications through Pixel-based Refinements

Understanding Mobile GUI: from Pixel-Words to Screen-Sentences

Optimal defined contribution pension management with jump diffusions and common shock dependence

Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels

A Game-Theoretic Approach to Decision Making for Multiple Vehicles at Roundabout

Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

Finite-dimensional approximation and non-squeezing for the cubic nonlinear Schrödinger equation on $\R^2$

Symplectic non-squeezing for the cubic NLS on the line

The focusing cubic NLS on exterior domains in three dimensions

Riesz transforms outside a convex obstacle

A New Ensemble of Rate-Compatible LDPC Codes

Quintic NLS in the exterior of a strictly convex obstacle

Remarks of Global Wellposedness of Liquid Crystal Flows and Heat Flows of Harmonic Maps in Two Dimensions

Global well-posedness and scattering for defocusing energy-critical NLS in the exterior of balls with radial data

Serial Concatenation of RS Codes with Kite Codes: Performance Analysis, Iterative Decoding and Design

Smooth global solutions for the two dimensional Euler Poisson system

Energy-critical NLS with quadratic potentials

Stability and Unconditional Uniqueness of Solutions for Energy Critical Wave Equations in High Dimensions