Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2026arXiv

An Executable Benchmarking Suite for Tool-Using Agents

Closed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit under a shared evidence-admission contract. The suite connects WebArena Verified, a SWE-Gym slice with SWE-bench-compatible verification, and MiniWoB++ through common workload adapters, task manifests, event schemas, replay/freeze policy, declared drivers, and reporting pipelines. In the canonical release, the gate separates paper-facing evidence from preflight, fixture, smoke, and diagnostic rows while preserving non-admitted artifacts for audit and onboarding. The admitted evidence records latency, invalid-action behavior, patch-generation cost, verifier metadata, replay bindings, and provenance under one auditable contract. The gate is decision-relevant rather than merely clerical: in a separate WebArena Verified controller study, clean-baseline and medium live-stressed evaluation select different fixed controller variants under the same workload and admission contract. The release is scoped as a benchmarking suite and admitted evidence, not a new agent policy, model leaderboard, backend comparison, or autonomous SWE-bench solver.

preprint2026arXiv

CD4LM: Consistency Distillation and aDaptive Decoding for Diffusion Language Models

Autoregressive large language models achieve strong results on many benchmarks, but decoding remains fundamentally latency-limited by sequential dependence on previously generated tokens. Diffusion language models (DLMs) promise parallel generation but suffer from a fundamental static-to-dynamic misalignment: Training optimizes local transitions under fixed schedules, whereas efficient inference requires adaptive "long-jump" refinements through unseen states. Our goal is to enable highly parallel decoding for DLMs with low number of function evaluations while preserving generation quality. To achieve this, we propose CD4LM, a framework that decouples training from inference via Discrete-Space Consistency Distillation (DSCD) and Confidence-Adaptive Decoding (CAD). Unlike standard objectives, DSCD trains a student to be trajectory-invariant, mapping diverse noisy states directly to the clean distribution. This intrinsic robustness enables CAD to dynamically allocate compute resources based on token confidence, aggressively skipping steps without the quality collapse typical of heuristic acceleration. On GSM8K, CD4LM matches the LLaDA baseline with a 5.18x wall-clock speedup; across code and math benchmarks, it strictly dominates the accuracy-efficiency Pareto frontier, achieving a 3.62x mean speedup while improving average accuracy. Code is available at https://github.com/yihao-liang/CDLM

preprint2026arXiv

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

Static-graph LLM decoders provide predictable launches, fixed tensor shapes, and low submission overhead, but online decoding exposes highly irregular KV-cache behavior: request lengths differ, EOS events arrive asynchronously, and logical histories fragment over time. Dynamic runtimes recover flexibility through paged KV management and step-level scheduling, while static-graph executors often over-reserve memory and suffer burst-time latency outliers. This paper studies whether much of this variability can be absorbed below a fixed decode interface. We present KV-RM, a runtime design that regularizes KV-cache movement beneath a static-graph LLM decoder. KV-RM decouples logical KV histories from physical storage, tracks active KV state through a block pager, and materializes each decode step through a single committed descriptor. A merge-staged transport path coalesces non-contiguous KV mappings into a small number of large transfer groups before a fixed-shape attention kernel consumes them. Optional bounded far-history summaries can be enabled under the same interface, but the core design does not depend on them. On a 2-GPU NVIDIA A100 node, KV-RM improves mixed-length decoding throughput and tail latency relative to a static-graph baseline, reduces reserved KV memory across workload families, and removes severe burst-time latency spikes under production-trace replay. These results suggest that KV-cache movement, rather than kernel shape, can be an effective boundary for recovering runtime flexibility in static-graph LLM serving.

preprint2022arXiv

InsPose: Instance-Aware Networks for Single-Stage Multi-Person Pose Estimation

Multi-person pose estimation is an attractive and challenging task. Existing methods are mostly based on two-stage frameworks, which include top-down and bottom-up methods. Two-stage methods either suffer from high computational redundancy for additional person detectors or they need to group keypoints heuristically after predicting all the instance-agnostic keypoints. The single-stage paradigm aims to simplify the multi-person pose estimation pipeline and receives a lot of attention. However, recent single-stage methods have the limitation of low performance due to the difficulty of regressing various full-body poses from a single feature vector. Different from previous solutions that involve complex heuristic designs, we present a simple yet effective solution by employing instance-aware dynamic networks. Specifically, we propose an instance-aware module to adaptively adjust (part of) the network parameters for each instance. Our solution can significantly increase the capacity and adaptive-ability of the network for recognizing various poses, while maintaining a compact end-to-end trainable pipeline. Extensive experiments on the MS-COCO dataset demonstrate that our method achieves significant improvement over existing single-stage methods, and makes a better balance of accuracy and efficiency compared to the state-of-the-art two-stage approaches. The code and models are available at \url{https://github.com/hikvision-research/opera}.

preprint2022arXiv

Optimizing Huffman Decoding for Error-Bounded Lossy Compression on GPUs

More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during simulation, but they also need to perform decompression efficiently for post hoc analysis. SZ is an error-bounded lossy compressor for scientific data, and cuSZ is a version of SZ designed to take advantage of the GPU's power. At present, cuSZ's compression performance has been optimized significantly while its decompression still suffers considerably lower performance because of its sophisticated lossless compression step -- a customized Huffman decoding. In this work, we aim to significantly improve the Huffman decoding performance for cuSZ, thus improving the overall decompression performance in turn. To this end, we first investigate two state-of-the-art GPU Huffman decoders in depth. Then, we propose a deep architectural optimization for both algorithms. Specifically, we take full advantage of CUDA GPU architectures by using shared memory on decoding/writing phases, online tuning the amount of shared memory to use, improving memory access patterns, and reducing warp divergence. Finally, we evaluate our optimized decoders on an Nvidia V100 GPU using eight representative scientific datasets. Our new decoding solution obtains an average speedup of 3.64X over cuSZ's Huffman decoder and improves its overall decompression performance by 2.43X on average.

preprint2022arXiv

SZx: an Ultra-fast Error-bounded Lossy Compressor for Scientific Datasets

Today's scientific high performance computing (HPC) applications or advanced instruments are producing vast volumes of data across a wide range of domains, which introduces a serious burden on data transfer and storage. Error-bounded lossy compression has been developed and widely used in scientific community, because not only can it significantly reduce the data volumes but it can also strictly control the data distortion based on the use-specified error bound. Existing lossy compressors, however, cannot offer ultra-fast compression speed, which is highly demanded by quite a few applications or use-cases (such as in-memory compression and online instrument data compression). In this paper, we propose a novel ultra-fast error-bounded lossy compressor, which can obtain fairly high compression performance on both CPU and GPU, also with reasonably high compression ratios. The key contributions are three-fold: (1) We propose a novel, generic ultra-fast error-bounded lossy compression framework -- called UFZ, by confining our design to be composed of only super-lightweight operations such as bitwise and addition/subtraction operation, still keeping a certain high compression ratio. (2) We implement UFZ on both CPU and GPU and optimize the performance according to their architectures carefully. (3) We perform a comprehensive evaluation with 6 real-world production-level scientific datasets on both CPU and GPU. Experiments show that UFZ is 2~16X as fast as the second-fastest existing error-bounded lossy compressor (either SZ or ZFP) on CPU and GPU, with respect to both compression and decompression.

preprint2021arXiv

Scalable and accurate multi-GPU based image reconstruction of large-scale ptychography data

While the advances in synchrotron light sources, together with the development of focusing optics and detectors, allow nanoscale ptychographic imaging of materials and biological specimens, the corresponding experiments can yield terabyte-scale large volumes of data that can impose a heavy burden on the computing platform. While Graphical Processing Units (GPUs) provide high performance for such large-scale ptychography datasets, a single GPU is typically insufficient for analysis and reconstruction. Several existing works have considered leveraging multiple GPUs to accelerate the ptychographic reconstruction. However, they utilize only Message Passing Interface (MPI) to handle the communications between GPUs. It poses inefficiency for the configuration that has multiple GPUs in a single node, especially while processing a single large projection, since it provides no optimizations to handle the heterogeneous GPU interconnections containing both low-speed links, e.g., PCIe, and high-speed links, e.g., NVLink. In this paper, we provide a multi-GPU implementation that can effectively solve large-scale ptychographic reconstruction problem with optimized performance on intra-node multi-GPU. We focus on the conventional maximum-likelihood reconstruction problem using conjugate-gradient (CG) for the solution and propose a novel hybrid parallelization model to address the performance bottlenecks in CG solver. Accordingly, we develop a tool called PtyGer (Ptychographic GPU(multiple)-based reconstruction), implementing our hybrid parallelization model design. The comprehensive evaluation verifies that PtyGer can fully preserve the original algorithm's accuracy while achieving outstanding intra-node GPU scalability.

preprint2020arXiv

Combinatorial Laser Molecular Beam Epitaxy System Integrated with Specialized Low-temperature Scanning Tunneling Microscopy

We present a newly developed facility, comprised of a combinatorial laser molecular beam epitaxy system and an in-situ scanning tunneling microscopy (STM). This facility aims at accelerating the materials research in a highly efficient way, by advanced high-throughput film synthesis techniques and subsequent fast characterization of surface morphology and electronic states. Compared with uniform films deposited by conventional methods, the so-called combinatorial thin films will be beneficial to determining the accurate phase diagrams of different materials due to the improved control of parameters such as chemical substitution and sample thickness resulting from a rotarymask method. A specially designed STM working under low-temperature and ultra-high vacuum conditions is optimized for the characterization of combinatorial thin films, in an XY coarse motion range of 15 mm $\times$ 15 mm and with sub-micrometer location precision. The overall configuration as well as some key aspects like sample holder design, scanner head, and sample/tip/target transfer mechanism are described in detail. The performance of the device is demonstrated by synthesizing high-quality superconducting FeSe thin films with gradient thickness, imaging surfaces of highly oriented pyrolytic graphite, Au (111), Bi2Sr2CaCu2O8+δ (BSCCO) and FeSe. In addition, we have also obtained clean noise spectra of tunneling junctions and the superconducting energy gap of BSCCO. The successful manufacturing of such a facility opens a new window for the next generation of equipment designed for experimental materials research.