Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
19works
0followers
14topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

19 published item(s)

preprint2026arXiv

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneous platform. Even academic PIM/PNM proposals still treat the GPU as the central hub for cross-device communication. Yet the GPU's compute-rich architecture is fundamentally mismatched with the memory-bound nature of decode-phase attention, inflating serving latency while wasting power and die area on idle compute units. The problem is compounded as reasoning and agentic workloads push context lengths toward one million tokens, making attention latency the primary user-facing bottleneck. To address these inefficiencies, we present AMMA, a multi-chiplet, memory-centric architecture for low-latency long-context attention. AMMA replaces GPU compute dies with HBM-PNM cubes, roughly doubling the available memory bandwidth to better serve memory-bound attention workloads. To translate this bandwidth into proportional performance gains, we introduce (i) a logic-die microarchitecture that fully exploits per-cube internal bandwidth for decode attention under a minimal power and area budget, (ii) a two-level hybrid parallelism scheme, and (iii) a reordered collective flow that reduces intra-chip die-to-die communication overhead. We further conduct a design-space exploration over per-cube compute power and intra-chip D2D link bandwidth, providing actionable guidance for hardware designers. Evaluations show that AMMA achieves 15.5X lower attention latency and 6.9X lower energy consumption compared with the NVIDIA H100.

preprint2026arXiv

Bridging Superconducting and Neutral-Atom Platforms for Efficient Fault-Tolerant Quantum Architectures

The transition to the fault-tolerant era exposes the limitations of homogeneous quantum systems, where no single qubit modality simultaneously offers optimal operation speed, connectivity, and scalability. In this work, we propose a strategic approach to Heterogeneous Quantum Architectures (HQA) that synthesizes the distinct advantages of the superconducting (SC) and neutral atom (NA) platforms. We explore two architectural role assignment strategies based on hardware characteristics: (1) We offload the latency-critical Magic State Factory (MSF) to fast SC devices while performing computation on scalable NA arrays, a design we term MagicAcc, which effectively mitigates the resource-preparation bottleneck. (2) We explore a Memory-Compute Separation (MCSep) paradigm that utilizes NA arrays for high-density qLDPC memory storage and SC devices for fast surface-code processing. Our evaluation, based on a comprehensive end-to-end cost model, demonstrates that principled heterogeneity yields significant performance gains. Specifically, our designs achieve $752\times$ speedup over NA-only baselines on average and reduce the physical qubit footprint by over $10\times$ compared to SC-only systems. These results chart a clear pathway for leveraging cross-modality interconnects to optimize the space-time efficiency of future fault-tolerant quantum computers.

preprint2026arXiv

ChipMATE: Multi-Agent Training via Reinforcement Learning for Enhanced RTL Generation

Existing API-based agentic systems for RTL code generation are fundamentally misaligned with industrial practice: they assume a golden testbench is available at generation time, rely on closed-source APIs incompatible with chip vendors' air-gapped security requirements, and cannot be trained on vendors' proprietary RTL codebases, leaving valuable internal data unused. Recent self-trained models address the deployment constraint but remain single-turn generators that overlook the critical role of verification in real industrial flows. To bridge these gaps, we present ChipMATE, the first self-trained multi-agent framework for RTL generation. Inspired by industrial practice where correctness emerges from cross-comparison between independently written RTL modules and reference models, ChipMATE pairs a Verilog agent with a Python reference-model agent that mutually verify each other's outputs without any golden oracle. We design a backtrack-based inference workflow to prevent error propagation across turns, and a two-stage training pipeline that first trains each agent individually to saturate its code-generation capability, then trains the team jointly to collaborate effectively. To support the training, we further build a hybrid data-generation framework that produces 64.4K high-quality reference model training samples. ChipMATE achieves 75.0\% and 80.1\% pass@1 on VerilogEval V2 with 4B and 9B base models, outperforming all existing self-trained models and even DeepSeek V4 with 1600B parameters. Our code and model weights are publicly available in https://github.com/zhongkaiyu/ChipMATE.

preprint2026arXiv

FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

LLM-based evolution has emerged as a promising way to improve agents by refining non-parametric artifacts, but its wall-clock cost remains a major bottleneck. We identify that this cost comes from synchronized stage execution and imbalance inside each LLM-heavy stage. We present FlashEvolve, an efficient framework that replaces synchronized execution with asynchronous workers and queues, allowing different stages and steps to overlap. To handle data staleness introduced by asynchrony, FlashEvolve tracks artifact versions and applies different policies to update, discard, or patch stale artifacts. Unlike weight-space staleness in asynchronous RL, language-space staleness is inspectable and repairable: a stale artifact is not just delayed work, but readable evidence that the LLM can reflect on, revise, and turn into useful evolution signal. FlashEvolve further improves throughput and token efficiency with speculative stage completion and adaptive workflow control. On GEPA workloads, FlashEvolve improves proposal throughput by $3.5\times$ on local vLLM and $4.9\times$ on API serving over synchronous GEPA. The same design also applies to ACE and Meta-Harness.

preprint2025arXiv

Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding

Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\times$ speedup over state-of-the-art baselines across multiple hardware setups.

preprint2022arXiv

CollComm: Enabling Efficient Collective Quantum Communication Based on EPR buffering

The noisy and lengthy nature of quantum communication hinders the development of distributed quantum computing. The inefficient design of existing compilers for distributed quantum computing worsens the situation. Previous compilation frameworks couple communication hardware with the implementation of expensive remote gates. However, we discover that the efficiency of quantum communication, especially collective communication, can be significantly boosted by decoupling communication resources from remote operations, that is, the communication hardware would be used only for preparing remote entanglement, and the computational hardware, the component used to store program information, would be used for conducting remote gates. Based on the observation, we develop a compiler framework to optimize the collective communication happening in distributed quantum programs. In this framework, we decouple the communication preparation process in communication hardware from the remote gates conducted in computational hardware by buffering EPR pairs generated by communication hardware in qubits of the computational hardware. Experimental results show that the proposed framework can halve the communication cost of various distributed quantum programs, compared to state-of-the-art compilers for distributed quantum computing.

preprint2022arXiv

Dynamic N:M Fine-grained Structured Sparse Attention Mechanism

Transformers are becoming the mainstream solutions for various tasks like NLP and Computer vision. Despite their success, the high complexity of the attention mechanism hinders them from being applied to latency-sensitive tasks. Tremendous efforts have been made to alleviate this problem, and many of them successfully reduce the asymptotic complexity to linear. Nevertheless, most of them fail to achieve practical speedup over the original full attention under moderate sequence lengths and are unfriendly to finetuning. In this paper, we present DFSS, an attention mechanism that dynamically prunes the full attention weight matrix to N:M fine-grained structured sparse pattern. We provide both theoretical and empirical evidence that demonstrates DFSS is a good approximation of the full attention mechanism. We propose a dedicated CUDA kernel design that completely eliminates the dynamic pruning overhead and achieves speedups under arbitrary sequence length. We evaluate the 1:2 and 2:4 sparsity under different configurations and achieve 1.27~ 1.89x speedups over the full-attention mechanism. It only takes a couple of finetuning epochs from the pretrained model to achieve on par accuracy with full attention mechanism on tasks from various domains under different sequence lengths from 384 to 4096.

preprint2022arXiv

GMI-DRL: Empowering Multi-GPU Deep Reinforcement Learning with GPU Spatial Multiplexing

With the increasing popularity of robotics in industrial control and autonomous driving, deep reinforcement learning (DRL) raises the attention of various fields. However, DRL computation on the modern powerful GPU platform is still inefficient due to its heterogeneous workloads and interleaved execution paradigm. To this end, we propose GMI-DRL, a systematic design to accelerate multi-GPU DRL via GPU spatial multiplexing. We introduce a novel design of resource-adjustable GPU multiplexing instances (GMIs) to match the actual needs of DRL tasks, an adaptive GMI management strategy to simultaneously achieve high GPU utilization and computation throughput, and a highly efficient inter-GMI communication support to meet the demands of various DRL communication patterns. Comprehensive experiments reveal that GMI-DRL outperforms state-of-the-art NVIDIA Isaac Gym with NCCL (up to 2.81X) and Horovod (up to 2.34X) support in training throughput on the latest DGX-A100 platform. Our work provides an initial user experience with GPU spatial multiplexing in processing heterogeneous workloads with a mixture of computation and communication.

preprint2022arXiv

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Sparse Matrix-Matrix Multiplication (SpMM) has served as fundamental components in various domains. Many previous studies exploit GPUs for SpMM acceleration because GPUs provide high bandwidth and parallelism. We point out that a static design does not always improve the performance of SpMM on different input data (e.g., >85\% performance loss with a single algorithm). In this paper, we consider the challenge of input dynamics from a novel auto-tuning perspective, while following issues remain to be solved: (1) Orthogonal design principles considering sparsity. Orthogonal design principles for such a sparse problem should be extracted to form different algorithms, and further used for performance tuning. (2) Nontrivial implementations in the algorithm space. Combining orthogonal design principles to create new algorithms needs to tackle with new challenges like thread race handling. (3) Heuristic adaptability to input dynamics. The heuristic adaptability is required to dynamically optimize code for input dynamics. To tackle these challenges, we first propose a novel three-loop model to extract orthogonal design principles for SpMM on GPUs. The model not only covers previous SpMM designs, but also comes up with new designs absent from previous studies. We propose techniques like conditional reduction to implement algorithms missing in previous studies. We further propose DA-SpMM, a Data-Aware heuristic GPU kernel for SpMM. DA-SpMM adaptively optimizes code considering input dynamics. Extensive experimental results show that, DA-SpMM achieves 1.26x~1.37x speedup compared with the best NVIDIA cuSPARSE algorithm on average, and brings up to 5.59x end-to-end speedup to applications like Graph Neural Networks.

preprint2022arXiv

LightSeq2: Accelerated Training for Transformer-based Models on GPUs

Transformer-based neural models are used in many AI applications. Training these models is expensive, as it takes huge GPU resources and long duration. It is challenging because typical data like sentences have variable lengths, and Transformer's computation patterns are more complex than convolutional neural networks. Existing systems either only focus on model inference or optimization for only BERT-like encoder models. In this paper, we present LightSeq2, a system to accelerate training for a general family of Transformer models on GPUs. We propose a series of GPU optimization techniques tailored to the specific computation flow and memory access patterns of Transformer models. LightSeq2 supports many model architectures, including BERT (encoder-only), GPT (decoder-only), Transformer (encoder-decoder), and vision Transformer. Our experiments for a variety of models and benchmarks show that LightSeq2 is consistently faster (1.4-3.5x) than previous systems on different GPUs. In particular, it gains 308% training speedup compared with existing systems on a large public machine translation benchmark (WMT14 English-German).

preprint2022arXiv

Shfl-BW: Accelerating Deep Neural Network Inference with Tensor-Core Aware Weight Pruning

Weight pruning in deep neural networks (DNNs) can reduce storage and computation cost, but struggles to bring practical speedup to the model inference time. Tensor-cores can significantly boost the throughput of GPUs on dense computation, but exploiting tensor-cores for sparse DNNs is very challenging. Compared to existing CUDA-cores, tensor-cores require higher data reuse and matrix-shaped instruction granularity, both difficult to yield from sparse DNN kernels. Existing pruning approaches fail to balance the demands of accuracy and efficiency: random sparsity preserves the model quality well but prohibits tensor-core acceleration, while highly-structured block-wise sparsity can exploit tensor-cores but suffers from severe accuracy loss. In this work, we propose a novel sparse pattern, Shuffled Block-wise sparsity (Shfl-BW), designed to efficiently utilize tensor-cores while minimizing the constraints on the weight structure. Our insight is that row- and column-wise permutation provides abundant flexibility for the weight structure, while introduces negligible overheads using our GPU kernel designs. We optimize the GPU kernels for Shfl-BW in linear and convolution layers. Evaluations show that our techniques can achieve the state-of-the-art speed-accuracy trade-offs on GPUs. For example, with small accuracy loss, we can accelerate the computation-intensive layers of Transformer by 1.81, 4.18 and 1.90 times on NVIDIA V100, T4 and A100 GPUs respectively at 75% sparsity.

preprint2021arXiv

DSXplore: Optimizing Convolutional Neural Networks via Sliding-Channel Convolutions

As the key advancement of the convolutional neural networks (CNNs), depthwise separable convolutions (DSCs) are becoming one of the most popular techniques to reduce the computations and parameters size of CNNs meanwhile maintaining the model accuracy. It also brings profound impact to improve the applicability of the compute- and memory-intensive CNNs to a broad range of applications, such as mobile devices, which are generally short of computation power and memory. However, previous research in DSCs are largely focusing on compositing the limited existing DSC designs, thus, missing the opportunities to explore more potential designs that can achieve better accuracy and higher computation/parameter reduction. Besides, the off-the-shelf convolution implementations offer limited computing schemes, therefore, lacking support for DSCs with different convolution patterns. To this end, we introduce, DSXplore, the first optimized design for exploring DSCs on CNNs. Specifically, at the algorithm level, DSXplore incorporates a novel factorized kernel -- sliding-channel convolution (SCC), featured with input-channel overlapping to balance the accuracy performance and the reduction of computation and memory cost. SCC also offers enormous space for design exploration by introducing adjustable kernel parameters. Further, at the implementation level, we carry out an optimized GPU-implementation tailored for SCC by leveraging several key techniques, such as the input-centric backward design and the channel-cyclic optimization. Intensive experiments on different datasets across mainstream CNNs show the advantages of DSXplore in balancing accuracy and computation/parameter reduction over the standard convolution and the existing DSCs.

preprint2021arXiv

QGTC: Accelerating Quantized Graph Neural Networks via GPU Tensor Core

Over the most recent years, quantized graph neural network (QGNN) attracts lots of research and industry attention due to its high robustness and low computation and memory overhead. Unfortunately, the performance gains of QGNN have never been realized on modern GPU platforms. To this end, we propose the first Tensor Core (TC) based computing framework, QGTC, to support any-bitwidth computation for QGNNs on GPUs. We introduce a novel quantized low-bit arithmetic design based on the low-bit data representation and bit-decomposed computation. We craft a novel TC-tailored CUDA kernel design by incorporating 3D-stacked bit compression, zero-tile jumping, and non-zero tile reuse technique to improve the performance systematically. We incorporate an effective bandwidth-optimized subgraph packing strategy to maximize the transferring efficiency between CPU host and GPU device. We integrate QGTC with Pytorch for better programmability and extensibility. Extensive experiments demonstrate that QGTC achieves on average 2.7x speedup compared with the state-of-the-art Deep Graph Library framework across diverse settings.

preprint2020arXiv

Comprehensive SNN Compression Using ADMM Optimization and Activity Regularization

As well known, the huge memory and compute costs of both artificial neural networks (ANNs) and spiking neural networks (SNNs) greatly hinder their deployment on edge devices with high efficiency. Model compression has been proposed as a promising technique to improve the running efficiency via parameter and operation reduction. Whereas, this technique is mainly practiced in ANNs rather than SNNs. It is interesting to answer how much an SNN model can be compressed without compromising its functionality, where two challenges should be addressed: i) the accuracy of SNNs is usually sensitive to model compression, which requires an accurate compression methodology; ii) the computation of SNNs is event-driven rather than static, which produces an extra compression dimension on dynamic spikes. To this end, we realize a comprehensive SNN compression through three steps. First, we formulate the connection pruning and weight quantization as a constrained optimization problem. Second, we combine spatio-temporal backpropagation (STBP) and alternating direction method of multipliers (ADMM) to solve the problem with minimum accuracy loss. Third, we further propose activity regularization to reduce the spike events for fewer active operations. These methods can be applied in either a single way for moderate compression or a joint way for aggressive compression. We define several quantitative metrics to evaluation the compression performance for SNNs. Our methodology is validated in pattern recognition tasks over MNIST, N-MNIST, CIFAR10, and CIFAR100 datasets, where extensive comparisons, analyses, and insights are provided. To our best knowledge, this is the first work that studies SNN compression in a comprehensive manner by exploiting all compressible components and achieves better results.

preprint2020arXiv

Proq: Projection-based Runtime Assertions for Debugging on a Quantum Computer

In this paper, we propose Proq, a runtime assertion scheme for testing and debugging quantum programs on a quantum computer. The predicates in Proq are represented by projections (or equivalently, closed subspaces of the state space), following Birkhoff-von Neumann quantum logic. The satisfaction of a projection by a quantum state can be directly checked upon a small number of projective measurements rather than a large number of repeated executions. On the theory side, we rigorously prove that checking projection-based assertions can help locate bugs or statistically assure that the semantic function of the tested program is close to what we expect, for both exact and approximate quantum programs. On the practice side, we consider hardware constraints and introduce several techniques to transform the assertions, making them directly executable on the measurement-restricted quantum computers. We also propose to achieve simplified assertion implementation using local projection technique with soundness guaranteed. We compare Proq with existing quantum program assertions and demonstrate the effectiveness and efficiency of Proq by its applications to assert two ingenious quantum algorithms, the Harrow-Hassidim-Lloyd algorithm and Shor's algorithm.

preprint2020arXiv

Scalable Adversarial Attack on Graph Neural Networks with Alternating Direction Method of Multipliers

Graph neural networks (GNNs) have achieved high performance in analyzing graph-structured data and have been widely deployed in safety-critical areas, such as finance and autonomous driving. However, only a few works have explored GNNs' robustness to adversarial attacks, and their designs are usually limited by the scale of input datasets (i.e., focusing on small graphs with only thousands of nodes). In this work, we propose, SAG, the first scalable adversarial attack method with Alternating Direction Method of Multipliers (ADMM). We first decouple the large-scale graph into several smaller graph partitions and cast the original problem into several subproblems. Then, we propose to solve these subproblems using projected gradient descent on both the graph topology and the node features that lead to considerably lower memory consumption compared to the conventional attack methods. Rigorous experiments further demonstrate that SAG can significantly reduce the computation and memory overhead compared with the state-of-the-art approach, making SAG applicable towards graphs with large size of nodes and edges.

preprint2020arXiv

SGQuant: Squeezing the Last Bit on Graph Neural Networks with Specialized Quantization

With the increasing popularity of graph-based learning, Graph Neural Networks (GNNs) win lots of attention from the research and industry field because of their high accuracy. However, existing GNNs suffer from high memory footprints (e.g., node embedding features). This high memory footprint hurdles the potential applications towards memory-constrained devices, such as the widely-deployed IoT devices. To this end, we propose a specialized GNN quantization scheme, SGQuant, to systematically reduce the GNN memory consumption. Specifically, we first propose a GNN-tailored quantization algorithm design and a GNN quantization fine-tuning scheme to reduce memory consumption while maintaining accuracy. Then, we investigate the multi-granularity quantization strategy that operates at different levels (components, graph topology, and layers) of GNN computation. Moreover, we offer an automatic bit-selecting (ABS) to pinpoint the most appropriate quantization bits for the above multi-granularity quantizations. Intensive experiments show that SGQuant can effectively reduce the memory footprint from 4.25x to 31.9x compared with the original full-precision GNNs while limiting the accuracy drop to 0.4% on average.

preprint2020arXiv

Uncertainty-aware Attention Graph Neural Network for Defending Adversarial Attacks

With the increasing popularity of graph-based learning, graph neural networks (GNNs) emerge as the essential tool for gaining insights from graphs. However, unlike the conventional CNNs that have been extensively explored and exhaustively tested, people are still worrying about the GNNs' robustness under the critical settings, such as financial services. The main reason is that existing GNNs usually serve as a black-box in predicting and do not provide the uncertainty on the predictions. On the other side, the recent advancement of Bayesian deep learning on CNNs has demonstrated its success of quantifying and explaining such uncertainties to fortify CNN models. Motivated by these observations, we propose UAG, the first systematic solution to defend adversarial attacks on GNNs through identifying and exploiting hierarchical uncertainties in GNNs. UAG develops a Bayesian Uncertainty Technique (BUT) to explicitly capture uncertainties in GNNs and further employs an Uncertainty-aware Attention Technique (UAT) to defend adversarial attacks on GNNs. Intensive experiments show that our proposed defense approach outperforms the state-of-the-art solutions by a significant margin.

preprint2019arXiv

Neural Network Model Extraction Attacks in Edge Devices by Hearing Architectural Hints

As neural networks continue their reach into nearly every aspect of software operations, the details of those networks become an increasingly sensitive subject. Even those that deploy neural networks embedded in physical devices may wish to keep the inner working of their designs hidden -- either to protect their intellectual property or as a form of protection from adversarial inputs. The specific problem we address is how, through heavy system stack, given noisy and imperfect memory traces, one might reconstruct the neural network architecture including the set of layers employed, their connectivity, and their respective dimension sizes. Considering both the intra-layer architecture features and the inter-layer temporal association information introduced by the DNN design empirical experience, we draw upon ideas from speech recognition to solve this problem. We show that off-chip memory address traces and PCIe events provide ample information to reconstruct such neural network architectures accurately. We are the first to propose such accurate model extraction techniques and demonstrate an end-to-end attack experimentally in the context of an off-the-shelf Nvidia GPU platform with full system stack. Results show that the proposed techniques achieve a high reverse engineering accuracy and improve the one's ability to conduct targeted adversarial attack with success rate from 14.6\%$\sim$25.5\% (without network architecture knowledge) to 75.9\% (with extracted network architecture).