Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
11works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

11 published item(s)

preprint2026arXiv

A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems

Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing (NDP) capabilities that offload experts to dedicated processing units. However, deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching. To address these challenges, this paper proposes an efficient inference framework featuring three key optimizations. First, the underexplored tensor parallelism in MoE inference is exploited to partition and compute large expert parameters across multiple NDP units simultaneously towards edge low-batch scenarios. Second, a load-balancing-aware scheduling algorithm distributes expert computations across NDP units and GPU to maximize resource utilization. Third, a dataset-free pre-fetching strategy proactively loads frequently accessed experts to minimize activation delays. Experimental results show that our framework enables GPU-NDP systems to achieve 2.41x on average and up to 2.56x speedup in end-to-end latency compared to state-of-the-art approaches, significantly enhancing MoE inference efficiency in resource-constrained environments.

preprint2026arXiv

CD-PIM: A High-Bandwidth and Compute-Efficient LPDDR5-Based PIM for Low-Batch LLM Acceleration on Edge-Device

Edge deployment of low-batch large language models (LLMs) faces critical memory bandwidth bottlenecks when executing memory-intensive general matrix-vector multiplications (GEMV) operations. While digital processing-in-memory (PIM) architectures promise to accelerate GEMV operations, existing PIM-equipped edge devices still suffer from three key limitations: limited bandwidth improvement, component under-utilization in mixed workloads, and low compute capacity of computing units (CUs). In this paper, we propose CD-PIM to address these challenges through three key innovations. First, we introduce a high-bandwidth compute-efficient mode (HBCEM) that enhances bandwidth by dividing each bank into four pseudo-banks through segmented global bitlines. Second, we propose a low-batch interleaving mode (LBIM) to improve component utilization by overlapping GEMV operations with GEMM operations. Third, we design a compute-efficient CU that performs enhanced GEMV operations in a pipelined manner by serially feeding weight data into the computing core. Forth, we adopt a column-wise mapping for the key-cache matrix and row-wise mapping for the value-cache matrix, which fully utilizes CU resources. Our evaluation shows that compared to a GPU-only baseline and state-of-the-art PIM designs, our CD-PIM achieves 11.42x and 4.25x speedup on average within a single batch in HBCEM mode, respectively. Moreover, for low-batch sizes, the CD-PIM achieves an average speedup of 1.12x in LBIM compared to HBCEM.

preprint2023arXiv

Designing Filter Functions of Frequency-Modulated Pulses for High-Fidelity Two-Qubit Gates in Ion Chains

High-fidelity two-qubit gates in quantum computers are often hampered by fluctuating experimental parameters. The effects of time-varying parameter fluctuations lead to coherent noise on the qubits, which can be suppressed by designing control signals with appropriate filter functions. Here, we develop filter functions for Mølmer-Sørensen gates of trapped-ion quantum computers that accurately predict the change in gate error due to small parameter fluctuations at any frequency. We then design the filter functions of frequency-modulated laser pulses, and compare this method with pulses that are robust to static offsets of the motional-mode frequencies. Experimentally, we measure the noise spectrum of the motional modes and use it for designing the filter functions, which improves the gate fidelity from 99.23(7)% to 99.55(7)% in a five-ion chain.

preprint2023arXiv

Realization of Scalable Cirac-Zoller Multi-Qubit Gates

The universality theorem in quantum computing states that any quantum computational task can be decomposed into a finite set of logic gates operating on one and two qubits. However, the process of such decomposition is generally inefficient, often leading to exponentially many gates to realize an arbitrary computational task. Practical processor designs benefit greatly from availability of multi-qubit gates that operate on more than two qubits to implement the desired circuit. In 1995, Cirac and Zoller proposed a method to realize native multi-qubit controlled-$Z$ gates in trapped ion systems, which has a stringent requirement on ground-state cooling of the motional modes utilized by the gate. An alternative approach, the Mølmer-Sørensen gate, is robust against residual motional excitation and has been a foundation for many high-fidelity gate demonstrations. This gate does not scale well beyond two qubits, incurring additional overhead when used to construct many target algorithms. Here, we take advantage of novel performance benefits of long ion chains to realize fully programmable and scalable high-fidelity Cirac-Zoller gates.

preprint2022arXiv

Determination of Multi-mode Motional Quantum States in a Trapped Ion System

Trapped atomic ions are a versatile platform for studying interactions between spins and bosons by coupling the internal states of the ions to their motion. Measurement of complex motional states with multiple modes is challenging, because all motional state populations can only be measured indirectly through the spin state of ions. Here we present a general method to determine the Fock state distributions and to reconstruct the density matrix of an arbitrary multi-mode motional state. We experimentally verify the method using different entangled states of multiple radial modes in a 5-ion chain. This method can be extended to any system with Jaynes-Cummings type interactions.

preprint2022arXiv

Hidden Inverses: Coherent Error Cancellation at the Circuit Level

Coherent gate errors are a concern in many proposed quantum computing architectures. These errors can be effectively handled through composite pulse sequences for single-qubit gates, however, such techniques are less feasible for entangling operations. In this work, we benchmark our coherent errors by comparing the actual performance of composite single-qubit gates to the predicted performance based on characterization of individual single-qubit rotations. We then propose a compilation technique, which we refer to as hidden inverses, that creates circuits robust to these coherent errors. We present experimental data showing that these circuits suppress both overrotation and phase misalignment errors in our trapped ion system.

preprint2020arXiv

Drying of porous media by concurrent drainage and evaporation: A pore network modeling study

Drainage and evaporation can occur simultaneously during the drying of porous media, but the interactions between these processes and their effects on drying are rarely studied. In this work, we develop a pore network model that considers drainage, evaporation, and rarefied multi-component gas transport in porous media with nanoscale pores. Using this model, we investigate the drying of a liquid solvent-saturated porous medium enabled by the flow of purge gas through it. Simulations show that drying progresses in three stages, and the solvent removal by drainage effects (evaporation effects) becomes increasingly weak (strong) as drying progresses through these stages. Interestingly, drainage can contribute considerably to solvent removal even after evaporation effects become very strong, especially when the applied pressure difference across the porous medium is low. We show that these phenomena are the results of the coupling between the drainage and evaporation effects and this coupling depends on the operating conditions and the stage of drying.

preprint2020arXiv

Machine learning in physics: The pitfalls of poisoned training sets

Known for their ability to identify hidden patterns in data, artificial neural networks are among the most powerful machine learning tools. Most notably, neural networks have played a central role in identifying states of matter and phase transitions across condensed matter physics. To date, most studies have focused on systems where different phases of matter and their phase transitions are known, and thus the performance of neural networks is well controlled. While neural networks present an exciting new tool to detect new phases of matter, here we demonstrate that when the training sets are poisoned (i.e., poor training data or mislabeled data) it is easy for neural networks to make misleading predictions.

preprint2020arXiv

On Integrated Access and Backhaul Networks: Current Status and Potentials

In this paper, we introduce and study the potentials and challenges of integrated access and backhaul (IAB) as one of the promising techniques for evolving 5G networks. We study IAB networks from different perspectives. We summarize the recent Rel-16 as well as the upcoming Rel-17 3GPP discussions on IAB, and highlight the main IAB-specific agreements on different protocol layers. Also, concentrating on millimeter wave-based communications, we evaluate the performance of IAB networks in both dense and suburban areas. Using a finite stochastic geometry model, with random distributions of IAB nodes as well as user equipments (UEs) in a finite region, we study the service coverage rate defined as the probability of the event that the UEs' minimum rate requirements are satisfied. We present comparisons between IAB and hybrid IAB/fiber-backhauled networks where a part or all of the small base stations are fiber-connected. Finally, we study the robustness of IAB networks to weather and various deployment conditions and verify their effects, such as blockage, tree foliage, rain as well as antenna height/gain on the coverage rate of IAB setups, as the key differences between the fiber-connected and IAB networks. As we show, IAB is an attractive approach to enable the network densification required by 5G and beyond.

preprint2020arXiv

PillarFlow: End-to-end Birds-eye-view Flow Estimation for Autonomous Driving

In autonomous driving, accurately estimating the state of surrounding obstacles is critical for safe and robust path planning. However, this perception task is difficult, particularly for generic obstacles/objects, due to appearance and occlusion changes. To tackle this problem, we propose an end-to-end deep learning framework for LIDAR-based flow estimation in bird's eye view (BeV). Our method takes consecutive point cloud pairs as input and produces a 2-D BeV flow grid describing the dynamic state of each cell. The experimental results show that the proposed method not only estimates 2-D BeV flow accurately but also improves tracking performance of both dynamic and static objects.

preprint2020arXiv

Real-Time Panoptic Segmentation from Dense Detections

Panoptic segmentation is a complex full scene parsing task requiring simultaneous instance and semantic segmentation at high resolution. Current state-of-the-art approaches cannot run in real-time, and simplifying these architectures to improve efficiency severely degrades their accuracy. In this paper, we propose a new single-shot panoptic segmentation network that leverages dense detections and a global self-attention mechanism to operate in real-time with performance approaching the state of the art. We introduce a novel parameter-free mask construction method that substantially reduces computational complexity by efficiently reusing information from the object detection and semantic segmentation sub-tasks. The resulting network has a simple data flow that does not require feature map re-sampling or clustering post-processing, enabling significant hardware acceleration. Our experiments on the Cityscapes and COCO benchmarks show that our network works at 30 FPS on 1024x2048 resolution, trading a 3% relative performance degradation from the current state of the art for up to 440% faster inference.