Source author record

Zhuang Li

Zhuang Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computer Vision hep-ph Computation and Language hep-ex Machine Learning Multimedia

Catalog footprint

What is connected

10works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents

Earth Observation (EO) analysis is inherently interactive: resolving uncertainty often requires expanding the region of interest, retrieving historical observations, and switching across sensors such as optical and Synthetic Aperture Radar. However, most EO benchmarks collapse this process into fixed-input, single-turn tasks. To address this gap, we present EO-Gym, a controlled executable framework for multimodal, tool-using EO agents that formulates EO analysis as a Gymnasium-style local geospatial workspace backed by more than 660k multimodal files indexed by location, time, and sensor type, with 35 EO-specialized tools spanning six task families. Built on this environment, we construct EO-Gym-Data, a benchmark of 9,078 trajectories and 34,604 reasoning steps, and grounded in eight public EO datasets together with Landsat and Sentinel-2 imagery. Evaluating $10$ open and closed VLMs shows that strong general-purpose models still struggle with interactive EO reasoning, especially on temporal and cross-modal workflows. As a reference baseline, EO-Gym-4B, obtained by fine-tuning Qwen3-VL-4B-Instruct on EO-Gym-Data, improves overall Pass@3 from $0.49$ to $0.74$ under the main evaluation setting. O-Gym provides a reproducible environment for interactive EO agents, operationalizing EO as an evidence-gathering problem that requires planning across geospatial, temporal, and sensing modality.

preprint2022arXiv

Contrastive Learning of Semantic and Visual Representations for Text Tracking

Semantic representation is of great benefit to the video text tracking(VTT) task that requires simultaneously classifying, detecting, and tracking texts in the video. Most existing approaches tackle this task by appearance similarity in continuous frames, while ignoring the abundant semantic features. In this paper, we explore to robustly track video text with contrastive learning of semantic and visual representations. Correspondingly, we present an end-to-end video text tracker with Semantic and Visual Representations(SVRep), which detects and tracks texts by exploiting the visual and semantic relationships between different texts in a video sequence. Besides, with a light-weight architecture, SVRep achieves state-of-the-art performance while maintaining competitive inference speed. Specifically, with a backbone of ResNet-18, SVRep achieves an ${\rm ID_{F1}}$ of $\textbf{65.9\%}$, running at $\textbf{16.7}$ FPS, on the ICDAR2015(video) dataset with $\textbf{8.6\%}$ improvement than the previous state-of-the-art methods.

preprint2022arXiv

Explanation of electron and muon $g-2$ anomalies in AMSB

We propose to jointly explain the electron/muon $g-2$ anomalies in the framework of anomaly mediated SUSY breaking (AMSB) scenario. Two Yukawa deflected AMSB models are proposed and discussed in depth: one with lepton-specific interactions and the other one with messenger-matter interactions. Both models are found to be able to jointly explain the anomalies at $2 σ$ level by naturally realizing the preferred parameter space with $μM_1,μM_2<0$ and very heavy left-handed smuon.

preprint2022arXiv

Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

Video text spotting(VTS) is the task that requires simultaneously detecting, tracking and recognizing text in the video. Existing video text spotting methods typically develop sophisticated pipelines and multiple models, which is not friend for real-time applications. Here we propose a real-time end-to-end video text spotter with Contrastive Representation learning (CoText). Our contributions are three-fold: 1) CoText simultaneously address the three tasks (e.g., text detection, tracking, recognition) in a real-time end-to-end trainable framework. 2) With contrastive learning, CoText models long-range dependencies and learning temporal information across multiple frames. 3) A simple, lightweight architecture is designed for effective and accurate performance, including GPU-parallel detection post-processing, CTC-based recognition head with Masked RoI. Extensive experiments show the superiority of our method. Especially, CoText achieves an video text spotting IDF1 of 72.0% at 41.0 FPS on ICDAR2015video, with 10.5% and 32.0 FPS improvement the previous best method. The code can be found at github.com/weijiawu/CoText.

preprint2022arXiv

The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation

The referring video object segmentation task (RVOS) aims to segment object instances in a given video referred by a language expression in all video frames. Due to the requirement of understanding cross-modal semantics within individual instances, this task is more challenging than the traditional semi-supervised video object segmentation where the ground truth object masks in the first frame are given. With the great achievement of Transformer in object detection and object segmentation, RVOS has been made remarkable progress where ReferFormer achieved the state-of-the-art performance. In this work, based on the strong baseline framework--ReferFormer, we propose several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference. The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.

preprint2022arXiv

The Third Place Solution for CVPR2022 AVA Accessibility Vision and Autonomy Challenge

The goal of AVA challenge is to provide vision-based benchmarks and methods relevant to accessibility. In this paper, we introduce the technical details of our submission to the CVPR2022 AVA Challenge. Firstly, we conducted some experiments to help employ proper model and data augmentation strategy for this task. Secondly, an effective training strategy was applied to improve the performance. Thirdly, we integrated the results from two different segmentation frameworks to improve the performance further. Experimental results demonstrate that our approach can achieve a competitive result on the AVA test set. Finally, our approach achieves 63.008\%AP@0.50:0.95 on the test set of CVPR2022 AVA Challenge.

preprint2021arXiv

Few-Shot Semantic Parsing for New Predicates

In this work, we investigate the problems of semantic parsing in a few-shot learning setting. In this setting, we are provided with utterance-logical form pairs per new predicate. The state-of-the-art neural semantic parsers achieve less than 25% accuracy on benchmark datasets when k= 1. To tackle this problem, we proposed to i) apply a designated meta-learning method to train the model; ii) regularize attention scores with alignment statistics; iii) apply a smoothing technique in pre-training. As a result, our method consistently outperforms all the baselines in both one and two-shot settings.

preprint2021arXiv

Gluino-SUGRA scenarios in light of FNAL muon g-2 anomaly

Gluino-SUGRA ($\tilde{g}$SUGRA), which is an economical extension of the predictive mSUGRA, adopts much heavier gluino mass parameter than other gauginos mass parameters and universal scalar mass parameter at the unification scale. It can elegantly reconcile the experimental results on the Higgs boson mass, the muon $g-2$, the null results in search for supersymmetry at the LHC and the results from B-physics. In this work, we propose several new ways to generate large gaugino hierarchy (i.e. $M_3\gg M_1,M_2$) for $\tilde{g}$SUGRA model building and then discuss in detail the implications of the new muon $g-2$ results with the updated LHC constraints on such $\tilde{g}$SUGRA scenarios. We obtain the following observations: (i) For the most interesting $M_1=M_2$ case at the GUT scale with a viable bino-like dark matter, the $\tilde{g}$SUGRA can explain the muon $g-2$ anomaly at $1σ$ level and be consistent with the updated LHC constraints for $6\leq M_3/M_1 \leq 9$ at the GUT scale; (ii) For $M_1:M_2=5:1$ at the GUT scale with wino-like dark matter, the $\tilde{g}$SUGRA model can explain the muon $g-2$ anomaly at $2σ$ level and be consistent with the updated LHC constraints for $3\leq M_3/M_1 \leq 3.2$ at the GUT scale; (iii) For $M_1:M_2=3:2$ at the GUT scale with mixed bino-wino dark matter, the $\tilde{g}$SUGRA model can explain the muon $g-2$ anomaly at $1σ$ level and be consistent with the updated LHC constraints for $6.9\leq M_3/M_1 \leq 7.5$ at the GUT scale. Although the choice of heavy gluino will always increase the FT involved, some of the $1σ/2σ$ survived points of $Δa_μ^{combine}$ can still allow low EWFT of order several hundreds and be fairly natural. Constraints from (dimension-five operator induced) proton decay are also discussed.

preprint2021arXiv

On Robustness of Neural Semantic Parsers

Semantic parsing maps natural language (NL) utterances into logical forms (LFs), which underpins many advanced NLP problems. Semantic parsers gain performance boosts with deep neural networks, but inherit vulnerabilities against adversarial examples. In this paper, we provide the empirical study on the robustness of semantic parsers in the presence of adversarial attacks. Formally, adversaries of semantic parsing are considered to be the perturbed utterance-LF pairs, whose utterances have exactly the same meanings as the original ones. A scalable methodology is proposed to construct robustness test sets based on existing benchmark corpora. Our results answered five research questions in measuring the sate-of-the-art parsers' performance on robustness test sets, and evaluating the effect of data augmentation.

preprint2020arXiv

Explaining The XENON1T Excess With Light Goldstini Dark Matter

In the scenario with a multiplicity of sectors which independently break supersymmetry, multiplicity of goldstini are predicted. We propose a new interpretation of the electron recoil excess at 2-7 keV observed in the XENON1T experiment with very long-lived goldstini DM elastically scattering off the electrons. The goldstini DM can be boosted by the late-decay of the other nearly degenerate (long-lived) goldstini DM, with their tiny mass difference being converted into kinetic energy of the lighter goldstini DM and neutrinos. We show that viable parameter space can be found which can explain the excess of electron recoil events around 2-3 keV recently reported by the XENON1T experiment.

Zhuang Li

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents

Contrastive Learning of Semantic and Visual Representations for Text Tracking

Explanation of electron and muon $g-2$ anomalies in AMSB

Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation

The Third Place Solution for CVPR2022 AVA Accessibility Vision and Autonomy Challenge

Few-Shot Semantic Parsing for New Predicates

Gluino-SUGRA scenarios in light of FNAL muon g-2 anomaly

On Robustness of Neural Semantic Parsers

Explaining The XENON1T Excess With Light Goldstini Dark Matter