Source author record

Chao Fang

Chao Fang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

15works

20topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems

Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing (NDP) capabilities that offload experts to dedicated processing units. However, deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching. To address these challenges, this paper proposes an efficient inference framework featuring three key optimizations. First, the underexplored tensor parallelism in MoE inference is exploited to partition and compute large expert parameters across multiple NDP units simultaneously towards edge low-batch scenarios. Second, a load-balancing-aware scheduling algorithm distributes expert computations across NDP units and GPU to maximize resource utilization. Third, a dataset-free pre-fetching strategy proactively loads frequently accessed experts to minimize activation delays. Experimental results show that our framework enables GPU-NDP systems to achieve 2.41x on average and up to 2.56x speedup in end-to-end latency compared to state-of-the-art approaches, significantly enhancing MoE inference efficiency in resource-constrained environments.

preprint2026arXiv

CD-PIM: A High-Bandwidth and Compute-Efficient LPDDR5-Based PIM for Low-Batch LLM Acceleration on Edge-Device

Edge deployment of low-batch large language models (LLMs) faces critical memory bandwidth bottlenecks when executing memory-intensive general matrix-vector multiplications (GEMV) operations. While digital processing-in-memory (PIM) architectures promise to accelerate GEMV operations, existing PIM-equipped edge devices still suffer from three key limitations: limited bandwidth improvement, component under-utilization in mixed workloads, and low compute capacity of computing units (CUs). In this paper, we propose CD-PIM to address these challenges through three key innovations. First, we introduce a high-bandwidth compute-efficient mode (HBCEM) that enhances bandwidth by dividing each bank into four pseudo-banks through segmented global bitlines. Second, we propose a low-batch interleaving mode (LBIM) to improve component utilization by overlapping GEMV operations with GEMM operations. Third, we design a compute-efficient CU that performs enhanced GEMV operations in a pipelined manner by serially feeding weight data into the computing core. Forth, we adopt a column-wise mapping for the key-cache matrix and row-wise mapping for the value-cache matrix, which fully utilizes CU resources. Our evaluation shows that compared to a GPU-only baseline and state-of-the-art PIM designs, our CD-PIM achieves 11.42x and 4.25x speedup on average within a single batch in HBCEM mode, respectively. Moreover, for low-batch sizes, the CD-PIM achieves an average speedup of 1.12x in LBIM compared to HBCEM.

preprint2023arXiv

Designing Filter Functions of Frequency-Modulated Pulses for High-Fidelity Two-Qubit Gates in Ion Chains

High-fidelity two-qubit gates in quantum computers are often hampered by fluctuating experimental parameters. The effects of time-varying parameter fluctuations lead to coherent noise on the qubits, which can be suppressed by designing control signals with appropriate filter functions. Here, we develop filter functions for Mølmer-Sørensen gates of trapped-ion quantum computers that accurately predict the change in gate error due to small parameter fluctuations at any frequency. We then design the filter functions of frequency-modulated laser pulses, and compare this method with pulses that are robust to static offsets of the motional-mode frequencies. Experimentally, we measure the noise spectrum of the motional modes and use it for designing the filter functions, which improves the gate fidelity from 99.23(7)% to 99.55(7)% in a five-ion chain.

preprint2023arXiv

Realization of Scalable Cirac-Zoller Multi-Qubit Gates

The universality theorem in quantum computing states that any quantum computational task can be decomposed into a finite set of logic gates operating on one and two qubits. However, the process of such decomposition is generally inefficient, often leading to exponentially many gates to realize an arbitrary computational task. Practical processor designs benefit greatly from availability of multi-qubit gates that operate on more than two qubits to implement the desired circuit. In 1995, Cirac and Zoller proposed a method to realize native multi-qubit controlled-$Z$ gates in trapped ion systems, which has a stringent requirement on ground-state cooling of the motional modes utilized by the gate. An alternative approach, the Mølmer-Sørensen gate, is robust against residual motional excitation and has been a foundation for many high-fidelity gate demonstrations. This gate does not scale well beyond two qubits, incurring additional overhead when used to construct many target algorithms. Here, we take advantage of novel performance benefits of long ion chains to realize fully programmable and scalable high-fidelity Cirac-Zoller gates.

preprint2022arXiv

Determination of Multi-mode Motional Quantum States in a Trapped Ion System

Trapped atomic ions are a versatile platform for studying interactions between spins and bosons by coupling the internal states of the ions to their motion. Measurement of complex motional states with multiple modes is challenging, because all motional state populations can only be measured indirectly through the spin state of ions. Here we present a general method to determine the Fock state distributions and to reconstruct the density matrix of an arbitrary multi-mode motional state. We experimentally verify the method using different entangled states of multiple radial modes in a 5-ion chain. This method can be extended to any system with Jaynes-Cummings type interactions.

preprint2022arXiv

Hidden Inverses: Coherent Error Cancellation at the Circuit Level

Coherent gate errors are a concern in many proposed quantum computing architectures. These errors can be effectively handled through composite pulse sequences for single-qubit gates, however, such techniques are less feasible for entangling operations. In this work, we benchmark our coherent errors by comparing the actual performance of composite single-qubit gates to the predicted performance based on characterization of individual single-qubit rotations. We then propose a compilation technique, which we refer to as hidden inverses, that creates circuits robust to these coherent errors. We present experimental data showing that these circuits suppress both overrotation and phase misalignment errors in our trapped ion system.

preprint2020arXiv

Drying of porous media by concurrent drainage and evaporation: A pore network modeling study

Drainage and evaporation can occur simultaneously during the drying of porous media, but the interactions between these processes and their effects on drying are rarely studied. In this work, we develop a pore network model that considers drainage, evaporation, and rarefied multi-component gas transport in porous media with nanoscale pores. Using this model, we investigate the drying of a liquid solvent-saturated porous medium enabled by the flow of purge gas through it. Simulations show that drying progresses in three stages, and the solvent removal by drainage effects (evaporation effects) becomes increasingly weak (strong) as drying progresses through these stages. Interestingly, drainage can contribute considerably to solvent removal even after evaporation effects become very strong, especially when the applied pressure difference across the porous medium is low. We show that these phenomena are the results of the coupling between the drainage and evaporation effects and this coupling depends on the operating conditions and the stage of drying.

preprint2020arXiv

Machine learning in physics: The pitfalls of poisoned training sets

Known for their ability to identify hidden patterns in data, artificial neural networks are among the most powerful machine learning tools. Most notably, neural networks have played a central role in identifying states of matter and phase transitions across condensed matter physics. To date, most studies have focused on systems where different phases of matter and their phase transitions are known, and thus the performance of neural networks is well controlled. While neural networks present an exciting new tool to detect new phases of matter, here we demonstrate that when the training sets are poisoned (i.e., poor training data or mislabeled data) it is easy for neural networks to make misleading predictions.

preprint2020arXiv

On Integrated Access and Backhaul Networks: Current Status and Potentials

In this paper, we introduce and study the potentials and challenges of integrated access and backhaul (IAB) as one of the promising techniques for evolving 5G networks. We study IAB networks from different perspectives. We summarize the recent Rel-16 as well as the upcoming Rel-17 3GPP discussions on IAB, and highlight the main IAB-specific agreements on different protocol layers. Also, concentrating on millimeter wave-based communications, we evaluate the performance of IAB networks in both dense and suburban areas. Using a finite stochastic geometry model, with random distributions of IAB nodes as well as user equipments (UEs) in a finite region, we study the service coverage rate defined as the probability of the event that the UEs' minimum rate requirements are satisfied. We present comparisons between IAB and hybrid IAB/fiber-backhauled networks where a part or all of the small base stations are fiber-connected. Finally, we study the robustness of IAB networks to weather and various deployment conditions and verify their effects, such as blockage, tree foliage, rain as well as antenna height/gain on the coverage rate of IAB setups, as the key differences between the fiber-connected and IAB networks. As we show, IAB is an attractive approach to enable the network densification required by 5G and beyond.

preprint2020arXiv

PillarFlow: End-to-end Birds-eye-view Flow Estimation for Autonomous Driving

In autonomous driving, accurately estimating the state of surrounding obstacles is critical for safe and robust path planning. However, this perception task is difficult, particularly for generic obstacles/objects, due to appearance and occlusion changes. To tackle this problem, we propose an end-to-end deep learning framework for LIDAR-based flow estimation in bird's eye view (BeV). Our method takes consecutive point cloud pairs as input and produces a 2-D BeV flow grid describing the dynamic state of each cell. The experimental results show that the proposed method not only estimates 2-D BeV flow accurately but also improves tracking performance of both dynamic and static objects.

preprint2020arXiv

Real-Time Panoptic Segmentation from Dense Detections

Panoptic segmentation is a complex full scene parsing task requiring simultaneous instance and semantic segmentation at high resolution. Current state-of-the-art approaches cannot run in real-time, and simplifying these architectures to improve efficiency severely degrades their accuracy. In this paper, we propose a new single-shot panoptic segmentation network that leverages dense detections and a global self-attention mechanism to operate in real-time with performance approaching the state of the art. We introduce a novel parameter-free mask construction method that substantially reduces computational complexity by efficiently reusing information from the object detection and semantic segmentation sub-tasks. The resulting network has a simple data flow that does not require feature map re-sampling or clustering post-processing, enabling significant hardware acceleration. Our experiments on the Cityscapes and COCO benchmarks show that our network works at 30 FPS on 1024x2048 resolution, trading a 3% relative performance degradation from the current state of the art for up to 440% faster inference.

preprint2016arXiv

borealis - A generalized global update algorithm for Boolean optimization problems

Optimization problems with Boolean variables that fall into the nondeterministic polynomial (NP) class are of fundamental importance in computer science, mathematics, physics and industrial applications. Most notably, solving constraint-satisfaction problems, which are related to spin-glass-like Hamiltonians in physics, remains a difficult numerical task. As such, there has been great interest in designing efficient heuristics to solve these computationally difficult problems. Inspired by parallel tempering Monte Carlo in conjunction with the rejection-free isoenergetic cluster algorithm developed for Ising spin glasses, we present a generalized global update optimization heuristic that can be applied to different NP-complete problems with Boolean variables. The global cluster updates allow for a wide-spread sampling of phase space, thus considerably speeding up optimization. By carefully tuning the pseudo-temperature (needed to randomize the configurations) of the problem, we show that the method can efficiently tackle optimization problems with over-constraints or on topologies with a large site-percolation threshold. We illustrate the efficiency of the heuristic on paradigmatic optimization problems, such as the maximum satisfiability problem and the vertex cover problem.

preprint2016arXiv

Content-Centric and Software-Defined Networking with Big Data

Many communities have researched the application of novel network architectures such as Content-Centric Networking (CCN) and Software-Defined Networking (SDN) to build the future Internet. Another emerging technology which is big data analysis has also won lots of attentions from academia to industry. Many splendid researches have been done on CCN, SDN, and big data, which all have addressed separately in the traditional literature. In this paper, we propose a novel network paradigm to jointly consider CCN, SDN, and big data, and provide the architecture internal data flow, big data processing and use cases which indicate the benefits and applicability. Simulation results are exhibited to show the potential benefits relating to the proposed network paradigm. We refer to this novel paradigm as Data-Driven Networking (DDN).

preprint2014arXiv

Global quantum discord in infinite quantum spin chains

In this paper, we study global quantum discord (GQD) in infinite-size spin chains. For this purpose, in the framework of matrix product states (MPSs), we propose an effective procedure to calculate GQD (denoted as Gn) for consecutive n-site subchains in infinite chains. For a spin-1/2 three-body interaction model, whose ground state can be exactly expressed as MPSs, We use the procedure to study Gn with n up to $24$. Then for a spin-1/2 XXZ chain, we firstly use infinite time-evolving block decimation (iTEBD) algorithm to obtain the approximate wavefunction in the from of MPSs, and then figure out Gn with n up to $18$. In both models, Gn shows an interesting linear growth as the increase of n, that is, Gn = k*n+b. Moreover, in non-critical regions the slope $k$ of Gn converges very fast, while in critical regions it converges relatively slow, and the behaviors are explained in a clear physical picture with the short-range and long-range correlations. Based on these results, we propose to use Gn/n to describe the global correlations in infinite chains. Gn/n has twofold physical meanings. Firstly, it can be regarded as "global discord per site", very similar to "energy per site" or "magnetization per site" in quantum magnetic systems. Secondly, Gn/n (when n is large enough) describes the quantum correlation between a single site and an (n-1)-site block. Then we successfully apply our theory to an exactly soluble infinite-size spin XY chain which is beyond the matrix product formula, and the Hamiltonian can reduce to the transverse-field Ising model and the XX model. The relation between GQD and quantum phase transitions in these models is discussed.

preprint2012arXiv

The trouble with asymptotically safe inflation

In this paper we investigate the perturbation theory of the asymptotically safe inflation and we find that all modes of gravitational waves perturbation become ghosts in order to achieve a large enough number of e-folds. Formally we can calculate the power spectrum of gravitational waves perturbation, but we find that it is negative. It indicates that there is serious trouble with the asymptotically safe inflation.

Chao Fang

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems

CD-PIM: A High-Bandwidth and Compute-Efficient LPDDR5-Based PIM for Low-Batch LLM Acceleration on Edge-Device

Designing Filter Functions of Frequency-Modulated Pulses for High-Fidelity Two-Qubit Gates in Ion Chains

Realization of Scalable Cirac-Zoller Multi-Qubit Gates

Determination of Multi-mode Motional Quantum States in a Trapped Ion System

Hidden Inverses: Coherent Error Cancellation at the Circuit Level

Drying of porous media by concurrent drainage and evaporation: A pore network modeling study

Machine learning in physics: The pitfalls of poisoned training sets

On Integrated Access and Backhaul Networks: Current Status and Potentials

PillarFlow: End-to-end Birds-eye-view Flow Estimation for Autonomous Driving

Real-Time Panoptic Segmentation from Dense Detections

borealis - A generalized global update algorithm for Boolean optimization problems

Content-Centric and Software-Defined Networking with Big Data

Global quantum discord in infinite quantum spin chains

The trouble with asymptotically safe inflation