Researcher profile

Xiao Zhou

Xiao Zhou contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2026arXiv

AcademiClaw: When Students Set Challenges for AI Agents

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.

preprint2026arXiv

MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions

Recently Multimodal Large Language Models (MLLMs) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs' capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs. We ground honesty in models' response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench, a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples, whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench, we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs' honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs. Our data and code can be found at https://github.com/yanxuzhu/MoHoBench.

preprint2026arXiv

OpenComputer: Verifiable Software Worlds for Computer-Use Agents

We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. Experiments show that OpenComputer's hard-coded verifiers align more closely with human adjudication than LLM-as-judge evaluation, especially when success depends on fine-grained application state. Frontier agents struggle with end-to-end completion despite partial progress, and open-source models exhibit sharp drops from their OSWorld-Verified scores, exposing a persistent gap in robust computer automation.

preprint2026arXiv

Why not Collaborative Filtering in Dual View? Bridging Sparse and Dense Models

Collaborative Filtering (CF) remains the cornerstone of modern recommender systems, with dense embedding--based methods dominating current practice. However, these approaches suffer from a critical limitation: our theoretical analysis reveals a fundamental signal-to-noise ratio (SNR) ceiling when modeling unpopular items, where parameter-based dense models experience diminishing SNR under severe data sparsity. To overcome this bottleneck, we propose SaD (Sparse and Dense), a unified framework that integrates the semantic expressiveness of dense embeddings with the structural reliability of sparse interaction patterns. We theoretically show that aligning these dual views yields a strictly superior global SNR. Concretely, SaD introduces a lightweight bidirectional alignment mechanism: the dense view enriches the sparse view by injecting semantic correlations, while the sparse view regularizes the dense model through explicit structural signals. Extensive experiments demonstrate that, under this dual-view alignment, even a simple matrix factorization--style dense model can achieve state-of-the-art performance. Moreover, SaD is plug-and-play and can be seamlessly applied to a wide range of existing recommender models, highlighting the enduring power of collaborative filtering when leveraged from dual perspectives. Further evaluations on real-world benchmarks show that SaD consistently outperforms strong baselines, ranking first on the BarsMatch leaderboard. The code is publicly available at https://github.com/harris26-G/SaD.

preprint2022arXiv

Effect of compositional fluctuation on the survival of bet-hedging species

Understanding the coexistence of diverse species in a changing environment is an important problem in community ecology. Bet-hedging is a strategy that helps species survive in such changing environments. However, studies of bet-hedging have often focused on the expected long-term growth rate of the species by itself, neglecting competition with other coexisting species. Here we study the extinction risk of a bet-hedging species in competition with others. We show that there are three contributions to the extinction risk. The first is the usual demographic fluctuation due to stochastic reproduction and selection processes in finite populations. The second, due to the fluctuation of population growth rate caused by environmental changes, may counterintuitively reduce the extinction risk for small populations. Besides those two, we reveal a third contribution, which is unique to bet-hedging species that diversify into multiple phenotypes: The phenotype composition of the population will fluctuate over time, resulting in increased extinction risk. We compare such compositional fluctuation to the demographic and environmental contributions, showing how they have different effects on the extinction risk depending on the population size, generation overlap, and environmental correlation.

preprint2022arXiv

Evaluation of non-pharmaceutical interventions and optimal strategies for containing the COVID-19 pandemic

Given multiple new COVID-19 variants are continuously emerging, non-pharmaceutical interventions are still primary control strategies to curb the further spread of coronavirus. However, implementing strict interventions over extended periods of time is inevitably hurting the economy. With an aim to solve this multi-objective decision-making problem, we investigate the underlying associations between policies, mobility patterns, and virus transmission. We further evaluate the relative performance of existing COVID-19 control measures and explore potential optimal strategies that can strike the right balance between public health and socio-economic recovery for individual states in the US. The results highlight the power of state of emergency declaration and wearing face masks and emphasize the necessity of pursuing tailor-made strategies for different states and phases of epidemiological transmission. Our framework enables policymakers to create more refined designs of COVID-19 strategies and can be extended to inform policy makers of any country about best practices in pandemic response.

preprint2022arXiv

Neural Global Shutter: Learn to Restore Video from a Rolling Shutter Camera with Global Reset Feature

Most computer vision systems assume distortion-free images as inputs. The widely used rolling-shutter (RS) image sensors, however, suffer from geometric distortion when the camera and object undergo motion during capture. Extensive researches have been conducted on correcting RS distortions. However, most of the existing work relies heavily on the prior assumptions of scenes or motions. Besides, the motion estimation steps are either oversimplified or computationally inefficient due to the heavy flow warping, limiting their applicability. In this paper, we investigate using rolling shutter with a global reset feature (RSGR) to restore clean global shutter (GS) videos. This feature enables us to turn the rectification problem into a deblur-like one, getting rid of inaccurate and costly explicit motion estimation. First, we build an optic system that captures paired RSGR/GS videos. Second, we develop a novel algorithm incorporating spatial and temporal designs to correct the spatial-varying RSGR distortion. Third, we demonstrate that existing image-to-image translation algorithms can recover clean GS videos from distorted RSGR inputs, yet our algorithm achieves the best performance with the specific designs. Our rendered results are not only visually appealing but also beneficial to downstream tasks. Compared to the state-of-the-art RS solution, our RSGR solution is superior in both effectiveness and efficiency. Considering it is easy to realize without changing the hardware, we believe our RSGR solution can potentially replace the RS solution in taking distortion-free videos with low noise and low budget.

preprint2022arXiv

NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video: Dataset, Methods and Results

This paper reviews the NTIRE 2022 Challenge on Super-Resolution and Quality Enhancement of Compressed Video. In this challenge, we proposed the LDV 2.0 dataset, which includes the LDV dataset (240 videos) and 95 additional videos. This challenge includes three tracks. Track 1 aims at enhancing the videos compressed by HEVC at a fixed QP. Track 2 and Track 3 target both the super-resolution and quality enhancement of HEVC compressed video. They require x2 and x4 super-resolution, respectively. The three tracks totally attract more than 600 registrations. In the test phase, 8 teams, 8 teams and 12 teams submitted the final results to Tracks 1, 2 and 3, respectively. The proposed methods and solutions gauge the state-of-the-art of super-resolution and quality enhancement of compressed video. The proposed LDV 2.0 dataset is available at https://github.com/RenYang-home/LDV_dataset. The homepage of this challenge (including open-sourced codes) is at https://github.com/RenYang-home/NTIRE22_VEnh_SR.

preprint2022arXiv

UNet#: A UNet-like Redesigning Skip Connections for Medical Image Segmentation

As an essential prerequisite for developing a medical intelligent assistant system, medical image segmentation has received extensive research and concentration from the neural network community. A series of UNet-like networks with encoder-decoder architecture has achieved extraordinary success, in which UNet2+ and UNet3+ redesign skip connections, respectively proposing dense skip connection and full-scale skip connection and dramatically improving compared with UNet in medical image segmentation. However, UNet2+ lacks sufficient information explored from the full scale, which will affect the learning of organs' location and boundary. Although UNet3+ can obtain the full-scale aggregation feature map, owing to the small number of neurons in the structure, it does not satisfy the segmentation of tiny objects when the number of samples is small. This paper proposes a novel network structure combining dense skip connections and full-scale skip connections, named UNet-sharp (UNet\#) for its shape similar to symbol \#. The proposed UNet\# can aggregate feature maps of different scales in the decoder sub-network and capture fine-grained details and coarse-grained semantics from the full scale, which benefits learning the exact location and accurately segmenting the boundary of organs or lesions. We perform deep supervision for model pruning to speed up testing and make it possible for the model to run on mobile devices; furthermore, designing two classification-guided modules to reduce false positives achieves more accurate segmentation results. Various experiments of semantic segmentation and instance segmentation on different modalities (EM, CT, MRI) and dimensions (2D, 3D) datasets, including the nuclei, brain tumor, liver, and lung, demonstrate that the proposed method outperforms state-of-the-art models.

preprint2020arXiv

Application of Seq2Seq Models on Code Correction

We apply various seq2seq models on programming language correction tasks on Juliet Test Suite for C/C++ and Java of Software Assurance Reference Datasets(SARD), and achieve 75\%(for C/C++) and 56\%(for Java) repair rates on these tasks. We introduce Pyramid Encoder in these seq2seq models, which largely increases the computational efficiency and memory efficiency, while remain similar repair rate to their non-pyramid counterparts. We successfully carry out error type classification task on ITC benchmark examples (with only 685 code instances) using transfer learning with models pre-trained on Juliet Test Suite, pointing out a novel way of processing small programing language datasets.

preprint2020arXiv

ASAS J174406+2446.8 is identified as a marginal-contact binary with a possible cool third body

ASAS J174406+2446.8 was originally found as a $δ$ Scuti-type pulsating star with the period P=0.189068 $days$ by ASAS survey. However, the LAMOST stellar parameters reveal that it is far beyond the red edge of pulsational instability strip on the $\log g-T$ diagram of $δ$ Scuti pulsating stars. To understand the physical properties of the variable star, we observed it by the 1.0-m Cassegrain reflecting telescope at Yunnan Observatories. Multi-color light curves in B, V, R$_{c}$ and I$_{c}$ bands were obtained and are analyzed by using the W-D program. It is found that this variable star is a shallow-contact binary with an EB-type light curve and an orbital period of 0.3781\,days rather than a $δ$ Scuti star. It is a W-subtype contact binary with a mass ratio of $1.135(\pm0.019)$ and a fill-out factor of $10.4(\pm5.6)\,\%$. The situation of ASAS J174406+2446.8 resembles those of other EB-type marginal-contact binaries such as UU Lyn, II Per and GW Tau. All of them are at a key evolutionary phase from a semi-detached configuration to a contact system predicted by the thermal relaxation oscillation theory. The linear ephemeris was corrected by using 303 new determined times of light minimum. It is detected that the O - C curve shows a sinusoidal variationthat could be explained by the light-travel-time effect via the presence of a cool red dwarf. The present investigation reveals that some of the $δ$ Scuti-type stars beyond the red edge of pulsating instability strip on the $\log g-T$ diagram are misclassified eclipsing binaries. To understand their structures and evolutionary states, more studies are required in the future.

preprint2020arXiv

Diversifying Dialogue Generation with Non-Conversational Text

Neural network-based sequence-to-sequence (seq2seq) models strongly suffer from the low-diversity problem when it comes to open-domain dialogue generation. As bland and generic utterances usually dominate the frequency distribution in our daily chitchat, avoiding them to generate more interesting responses requires complex data filtering, sampling techniques or modifying the training objective. In this paper, we propose a new perspective to diversify dialogue generation by leveraging non-conversational text. Compared with bilateral conversations, non-conversational text are easier to obtain, more diverse and cover a much broader range of topics. We collect a large-scale non-conversational corpus from multi sources including forum comments, idioms and book snippets. We further present a training paradigm to effectively incorporate these text via iterative back translation. The resulting model is tested on two conversational datasets and is shown to produce significantly more diverse responses without sacrificing the relevance with context.

preprint2019arXiv

Photometric investigation on the W-subtype contact binary V1197 Her

Multi-color light curves of V1197 Her were obtained with the 2.4 meter optical telescope at Thai National Observatory and the Wilson-Devinney (W-D) program is used to model the observational light curves. The photometric solutions reveal that V1197 Her is a W-subtype shallow contact binary system with a mass ratio of $q = 2.61 $ and fill-out factor to be $f = 15.7\,\%$. The temperature difference between the primary star and secondary star is only $140K$ in spite of the low degree of contact, which means that V1197 Her is not only in geometrical contact configuration but also already under thermal contact status. The orbital inclination of V1197 Her is as high as $i = 82.7^{\circ}$, and the primary star is completely eclipsed at the primary minimum. The totally eclipsing characteristic implies that the determined physical parameters are highly reliable. The masses, radii and luminosities of the primary star (star 1) and secondary star (star 2) are estimated to be $M_{1} = 0.30(1)M_\odot$, $M_{2} = 0.77(2)M_\odot$, $R_{1} = 0.54(1)R_\odot$, $R_{2} = 0.83(1)R_\odot$, $L_{1} = 0.18(1)L_\odot$ and $L_{2} = 0.38(1)L_\odot$. The evolutionary status of the two component stars are drawn in the H - R diagram, which shows that the less massive but hotter primary star is more evolved than the secondary star. The period of V1197 Her is decreasing continuously at a rate of $dP/dt=-2.58\times{10^{-7}}day\cdot year^{-1}$, which can be explained by mass transfer from the more massive star to the less massive one with a rate of $\frac{dM_{2}}{dt}=- 1.61\times{10^{-7}}M_\odot/year$. The light curves of V1197 Her is reported to have the O'Connell effect. Thus, a cool spot is added to the massive star to model the asymmetry on light curves.