Source author record

Zhenyang Li

Zhenyang Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision math.DS Computation and Language Artificial Intelligence math-ph math.DG math.MP Robotics

Catalog footprint

What is connected

15works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

ConsistNav: Closing the Action Consistency Gap in Zero-Shot Object Navigation with Semantic Executive Control

Zero-shot object navigation has advanced rapidly with open-vocabulary detectors, image--text models, and language-guided exploration. However, even after current methods detect a plausible target hypothesis, the agent may still oscillate between exploration and pursuit, or abandon the object near success. We identify this failure mode as an action consistency gap: semantic evidence is repeatedly reinterpreted at each step without persistent commitment across the episode. We introduce ConsistNav, a training-free zero-shot ObjectNav framework built around a semantic executive composed of three coordinated modules: Finite-State Executive Controller stages target pursuit through guarded semantic phases; Persistent Candidate Memory accumulates cross-frame target evidence into stable object hypotheses; and Stability-Aware Action Control suppresses rotational stagnation, ineffective pursuit, and unverified stopping. This design changes neither the detector nor the low-level planner; instead, it controls when semantic evidence should influence navigation and when it should be suppressed or revisited. We conduct extensive experiments on HM3D and MP3D, where ConsistNav achieves state-of-the-art results among compared zero-shot ObjectNav methods and improves SR by 11.4% and SPL by 7.9% over the controlled baseline on MP3D. Ablation studies and real-world deployment experiments further demonstrate the effectiveness and robustness of the proposed executive mechanism.

preprint2026arXiv

DiffER: Diffusion Entity-Relation Modeling for Reversal Curse in Diffusion Large Language Models

The "reversal curse" refers to the phenomenon where large language models (LLMs) exhibit predominantly unidirectional behavior when processing logically bidirectional relationships. Prior work attributed this to autoregressive training -- predicting the next token inherently favors left-to-right information flow over genuine bidirectional knowledge associations. However, we observe that Diffusion LLMs (DLLMs), despite being trained bidirectionally, also suffer from the reversal curse. To investigate the root causes, we conduct systematic experiments on DLLMs and identify three key reasons: 1) entity fragmentation during training, 2) data asymmetry, and 3) missing entity relations. Motivated by the analysis of these reasons, we propose Diffusion Entity-Relation Modeling (DiffER), which addresses the reversal curse through entity-aware training and balanced data construction. Specifically, DiffER introduces whole-entity masking, which mitigates entity fragmentation by predicting complete entities in a single step. DiffER further employs distribution-symmetric and relation-enhanced data construction strategies to alleviate data asymmetry and missing relations. Extensive experiments demonstrate that DiffER effectively alleviates the reversal curse in Diffusion LLMs, offering new perspectives for future research.

preprint2026arXiv

ES-Mem: Event Segmentation-Based Memory for Long-Term Dialogue Agents

Memory is critical for dialogue agents to maintain coherence and enable continuous adaptation in long-term interactions. While existing memory mechanisms offer basic storage and retrieval capabilities, they are hindered by two primary limitations: (1) rigid memory granularity often disrupts semantic integrity, resulting in fragmented and incoherent memory units; (2) prevalent flat retrieval paradigms rely solely on surface-level semantic similarity, neglecting the structural cues of discourse required to navigate and locate specific episodic contexts. To mitigate these limitations, drawing inspiration from Event Segmentation Theory, we propose ES-Mem, a framework incorporating two core components: (1) a dynamic event segmentation module that partitions long-term interactions into semantically coherent events with distinct boundaries; (2) a hierarchical memory architecture that constructs multi-layered memories and leverages boundary semantics to anchor specific episodic memory for precise context localization. Evaluations on two memory benchmarks demonstrate that ES-Mem yields consistent performance gains over baseline methods. Furthermore, the proposed event segmentation module exhibits robust applicability on dialogue segmentation datasets.

preprint2022arXiv

Alignment-guided Temporal Attention for Video Action Recognition

Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D+1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more efficient in computation, the latter often obtains better performance. In this paper, we attribute this to a dilemma between the sufficiency and the efficiency of interactions among various positions in different frames. These interactions affect the extraction of task-relevant information shared among frames. To resolve this issue, we prove that frame-by-frame alignments have the potential to increase the mutual information between frame representations, thereby including more task-relevant information to boost effectiveness. Then we propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames. It can act as a general plug-in for image backbones to conduct the action recognition task without any model-specific design. Extensive experiments on multiple benchmarks demonstrate the superiority and generality of our module.

preprint2022arXiv

Enhancing Multi-view Stereo with Contrastive Matching and Weighted Focal Loss

Learning-based multi-view stereo (MVS) methods have made impressive progress and surpassed traditional methods in recent years. However, their accuracy and completeness are still struggling. In this paper, we propose a new method to enhance the performance of existing networks inspired by contrastive learning and feature matching. First, we propose a Contrast Matching Loss (CML), which treats the correct matching points in depth-dimension as positive sample and other points as negative samples, and computes the contrastive loss based on the similarity of features. We further propose a Weighted Focal Loss (WFL) for better classification capability, which weakens the contribution of low-confidence pixels in unimportant areas to the loss according to predicted confidence. Extensive experiments performed on DTU, Tanks and Temples and BlendedMVS datasets show our method achieves state-of-the-art performance and significant improvement over baseline network.

preprint2022arXiv

Factorized and Controllable Neural Re-Rendering of Outdoor Scene for Photo Extrapolation

Expanding an existing tourist photo from a partially captured scene to a full scene is one of the desired experiences for photography applications. Although photo extrapolation has been well studied, it is much more challenging to extrapolate a photo (i.e., selfie) from a narrow field of view to a wider one while maintaining a similar visual style. In this paper, we propose a factorized neural re-rendering model to produce photorealistic novel views from cluttered outdoor Internet photo collections, which enables the applications including controllable scene re-rendering, photo extrapolation and even extrapolated 3D photo generation. Specifically, we first develop a novel factorized re-rendering pipeline to handle the ambiguity in the decomposition of geometry, appearance and illumination. We also propose a composited training strategy to tackle the unexpected occlusion in Internet images. Moreover, to enhance photo-realism when extrapolating tourist photographs, we propose a novel realism augmentation process to complement appearance details, which automatically propagates the texture details from a narrow captured photo to the extrapolated neural rendered image. The experiments and photo editing examples on outdoor scenes demonstrate the superior performance of our proposed method in both photo-realism and downstream applications.

preprint2020arXiv

A Benchmark dataset for both underwater image enhancement and underwater object detection

Underwater image enhancement is such an important vision task due to its significance in marine engineering and aquatic robot. It is usually work as a pre-processing step to improve the performance of high level vision tasks such as underwater object detection. Even though many previous works show the underwater image enhancement algorithms can boost the detection accuracy of the detectors, no work specially focus on investigating the relationship between these two tasks. This is mainly because existing underwater datasets lack either bounding box annotations or high quality reference images, based on which detection accuracy or image quality assessment metrics are calculated. To investigate how the underwater image enhancement methods influence the following underwater object detection tasks, in this paper, we provide a large-scale underwater object detection dataset with both bounding box annotations and high quality reference images, namely OUC dataset. The OUC dataset provides a platform for researchers to comprehensive study the influence of underwater image enhancement algorithms on the underwater object detection task.

preprint2016arXiv

Online Action Detection

In online action detection, the goal is to detect the start of an action in a video stream as soon as it happens. For instance, if a child is chasing a ball, an autonomous car should recognize what is going on and respond immediately. This is a very challenging problem for four reasons. First, only partial actions are observed. Second, there is a large variability in negative data. Third, the start of the action is unknown, so it is unclear over what time window the information should be integrated. Finally, in real world data, large within-class variability exists. This problem has been addressed before, but only to some extent. Our contributions to online action detection are threefold. First, we introduce a realistic dataset composed of 27 episodes from 6 popular TV series. The dataset spans over 16 hours of footage annotated with 30 action classes, totaling 6,231 action instances. Second, we analyze and compare various baseline methods, showing this is a challenging problem for which none of the methods provides a good solution. Third, we analyze the change in performance when there is a variation in viewpoint, occlusion, truncation, etc. We introduce an evaluation protocol for fair comparison. The dataset, the baselines and the models will all be made publicly available to encourage (much needed) further research on online action detection on realistic data.

preprint2016arXiv

Singular SRB measures for a non 1--1 map of the unit square

We consider a map of the unit square which is not 1--1, such as the memory map studied in \cite{MwM1}. Memory maps are defined as follows: $x_{n+1}=M_α(x_{n-1},x_{n})=τ(α\cdot x_{n}+(1-α)\cdot x_{n-1}),$ where $τ$ is a one-dimensional map on $I=[0,1]$ and $0<α<1$ determines how much memory is being used. In this paper we let $τ$ to be the symmetric tent map. To study the dynamics of $M_α$, we consider the two-dimensional map $$ G_{α}:[x_{n-1},x_{n}]\mapsto [x_{n},τ(α\cdot x_{n}+(1-α)\cdot x_{n-1})]\, .$$ The map $G_α$ for $α\in(0,3/4]$ was studied in \cite{MwM1}. In this paper we prove that for $α\in(3/4,1)$ the map $G_α$ admits a singular Sinai-Ruelle-Bowen measure. We do this by applying Rychlik's results for the Lozi map. However, unlike the Lozi map, the maps $G_α$ are not invertible which creates complications that we are able to overcome.

preprint2016arXiv

Statistical and Deterministic Dynamics of Maps with Memory

We consider a dynamical system to have memory if it remembers the current state as well as the state before that. The dynamics is defined as follows: $x_{n+1}=T_α(x_{n-1},x_{n})=τ(α\cdot x_{n}+(1-α)\cdot x_{n-1}),$ where $τ$ is a one-dimensional map on $I=[0,1]$ and $0<α<1$ determines how much memory is being used. $T_α$ does not define a dynamical system since it maps $U=I\times I$ into $I$. In this note we let $τ$ to be the symmetric tent map. We shall prove that for $0<α<0.46,$ the orbits of $\{x_{n}\}$ are described statistically by an absolutely continuous invariant measure (acim) in two dimensions. As $α$ approaches $0.5 $ from below, that is, as we approach a balance between the memory state and the present state, the support of the acims become thinner until at $α=0.5$, all points have period 3 or eventually possess period 3. For $0.5<α<0.75$, we have a global attractor: for all starting points in $U$ except $(0,0)$, the orbits are attracted to the fixed point $(2/3,2/3).$ At $α=0.75,$ we have slightly more complicated periodic behavior.

preprint2016arXiv

VideoLSTM Convolves, Attends and Flows for Action Recognition

We present a new architecture for end-to-end sequence learning of actions in video, we call VideoLSTM. Rather than adapting the video to the peculiarities of established recurrent or convolutional architectures, we adapt the architecture to fit the requirements of the video medium. Starting from the soft-Attention LSTM, VideoLSTM makes three novel contributions. First, video has a spatial layout. To exploit the spatial correlation we hardwire convolutions in the soft-Attention LSTM architecture. Second, motion not only informs us about the action content, but also guides better the attention towards the relevant spatio-temporal locations. We introduce motion-based attention. And finally, we demonstrate how the attention from VideoLSTM can be used for action localization by relying on just the action class label. Experiments and comparisons on challenging datasets for action classification and localization support our claims.

preprint2014arXiv

Toward A Mathematical Holographic Principle

In work started in [17] and continued in this paper our objective is to study selectors of multivalued functions which have interesting dynamical properties, such as possessing absolutely continuous invariant measures. We specify the graph of a multivalued function by means of lower and upper boundary maps $τ_{1}$ and $τ_{2}.$ On these boundary maps we define a position dependent random map $R_{p}=\{τ_{1},τ_{2};p,1-p\},$ which, at each time step, moves the point $x$ to $τ_{1}(x)$ with probability $p(x)$ and to $τ_{2}(x)$ with probability $1-p(x)$. Under general conditions, for each choice of $p$, $R_{p}$ possesses an absolutely continuous invariant measure with invariant density $f_{p}.$ Let $\boldsymbolτ$ be a selector which has invariant density function $f.$ One of our objectives is to study conditions under which $p(x)$ exists such that $R_{p}$ has $f$ as its invariant density function. When this is the case, the long term statistical dynamical behavior of a selector can be represented by the long term statistical behavior of a random map on the boundaries of $G.$ We refer to such a result as a mathematical holographic principle. We present examples and study the relationship between the invariant densities attainable by classes of selectors and the random maps based on the boundaries and show that, under certain conditions, the extreme points of the invariant densities for selectors are achieved by bang-bang random maps, that is, random maps for which $p(x)\in \{0,1\}.$

preprint2011arXiv

$W$-like maps with various instabilities of acim's

This paper generalizes the results of [13] and then provides an interesting example. We construct a family of $W$-like maps $\{W_a\}$ with a turning fixed point having slope $s_1$ on one side and $-s_2$ on the other. Each $W_a$ has an absolutely continuous invariant measure $μ_a$. Depending on whether $\frac{1}{s_1}+\frac{1}{s_2}$ is larger, equal or smaller than 1, we show that the limit of $μ_a$ is a singular measure, a combination of singular and absolutely continuous measure or an absolutely continuous measure, respectively. It is known that the invariant density of a single piecewise expanding map has a positive lower bound on its support. In Section 4 we give an example showing that in general, for a family of piecewise expanding maps with slopes larger than 2 in modulus and converging to a piecewise expanding map, their invariant densities do not necessarily have a positive lower bound on the support.

preprint2011arXiv

Instability of Isolated Spectrum for W-shaped Maps

In this note we consider $W$-shaped map $W_0=W_{s_1,s_2}$ with $\frac {1}{s_1}+\frac {1}{s_2}=1$ and show that eigenvalue 1 is not stable. We do this in a constructive way. For each perturbing map $W_a$ we show the existence of the "second" eigenvalue $λ_a$, such that $λ_a\to 1$, as $a\to 0$, which proves instability of isolated spectrum of $W_0$. At the same time, the existence of second eigenvalues close to 1 causes the maps $W_a$ behave in a metastable way. They have two almost invariant sets and the system spends long periods of consecutive iterations in each of them with infrequent jumps from one to the other.

preprint2007arXiv

Asymptotically Hyperbolic Metrics on Unit Ball Admitting Multiple Horizons

In this paper, we construct an asymptotically hyperbolic metric with scalar curvature -6 on unit ball $\mathbf{D}^3$, which contains multiple horizons.

Zhenyang Li

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

ConsistNav: Closing the Action Consistency Gap in Zero-Shot Object Navigation with Semantic Executive Control

DiffER: Diffusion Entity-Relation Modeling for Reversal Curse in Diffusion Large Language Models

ES-Mem: Event Segmentation-Based Memory for Long-Term Dialogue Agents

Alignment-guided Temporal Attention for Video Action Recognition

Enhancing Multi-view Stereo with Contrastive Matching and Weighted Focal Loss

Factorized and Controllable Neural Re-Rendering of Outdoor Scene for Photo Extrapolation

A Benchmark dataset for both underwater image enhancement and underwater object detection

Online Action Detection

Singular SRB measures for a non 1--1 map of the unit square

Statistical and Deterministic Dynamics of Maps with Memory

VideoLSTM Convolves, Attends and Flows for Action Recognition

Toward A Mathematical Holographic Principle

$W$-like maps with various instabilities of acim's

Instability of Isolated Spectrum for W-shaped Maps

Asymptotically Hyperbolic Metrics on Unit Ball Admitting Multiple Horizons