Researcher profile

Han Cai

Han Cai contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
12topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2026arXiv

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

Few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated at test time, limiting their effectiveness for any-step video diffusion. This limitation arises because consistency distillation replaces the original probability-flow ODE trajectory with a consistency-sampling trajectory, weakening the desirable test-time scaling behavior of ODE sampling. To address this limitation, we introduce AnyFlow, the first any-step video diffusion distillation framework based on flow maps. Instead of distilling a model for only a few fixed sampling steps, AnyFlow optimizes the full ODE sampling trajectory. To this end, we shift the distillation target from endpoint consistency mapping $(z_{t}\rightarrow z_{0})$ to flow-map transition learning $(z_{t}\rightarrow z_{r})$ over arbitrary time intervals. We further propose Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions, enabling efficient on-policy distillation that reduces test-time errors (i.e., discretization error in few-step sampling and exposure bias in causal generation). Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B parameters, demonstrate that AnyFlow achieves performance matches or surpasses consistency-based counterparts in the few-step regime, while scaling with sampling step budgets.

preprint2022arXiv

A Bound on the Minimal Field Size of LRCs, and Cyclic MR Codes That Attain It

We prove a new lower bound on the field size of locally repairable codes (LRCs). Additionally, we construct maximally recoverable (MR) codes which are cyclic. While a known construction for MR codes has the same parameters, it produces non-cyclic codes. Furthermore, we prove both necessary conditions and sufficient conditions that specify when the known non-cyclic MR codes may be permuted to become cyclic, thus proving our construction produces cyclic MR codes with new parameters. Furthermore, using our new bound on the field size, we show that the new cyclic MR codes have optimal field size in certain cases. Other known LRCs are also shown to have optimal field size in certain cases.

preprint2022arXiv

A New Cooperative Repair Scheme with k + 1 Helper Nodes for (n, k) Hadamard MSR codes with Small Sub-packetization

Cooperative repair model is an available technology to deal with multiple node failures in distributed storage systems. Recently, explicit constructions of cooperative MSR codes were given by Ye (IEEE Transactions on Information Theory, 2020) with sub-packetization level $(d-k+h)(d-k+1)^n$. Specifically, the sub-packetization level is $(h+1)2^n$ when $d=k+1$. In this paper, we propose a new cooperative repair scheme by means of the inter-instance and intra-instance pairing inherited from the perfect code which reduces the sub-packetization to $2^n$ when $(h+1)|2^n$ and $(2\ell+1)2^n$ when $h+1=(2\ell+1)2^m$ for $m\ge 0$, $\ell\ge 1$ with $d=k+1$ helper nodes. That is to say, the sub-packetization is $h + 1 $ times or $2^m$ times less than Ye's. It turned out to be the best result so far known.

preprint2022arXiv

Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Deep neural networks (DNNs) have achieved unprecedented success in the field of artificial intelligence (AI), including computer vision, natural language processing and speech recognition. However, their superior performance comes at the considerable cost of computational complexity, which greatly hinders their applications in many resource-constrained devices, such as mobile phones and Internet of Things (IoT) devices. Therefore, methods and techniques that are able to lift the efficiency bottleneck while preserving the high accuracy of DNNs are in great demand in order to enable numerous edge AI applications. This paper provides an overview of efficient deep learning methods, systems and applications. We start from introducing popular model compression methods, including pruning, factorization, quantization as well as compact model design. To reduce the large design cost of these manual solutions, we discuss the AutoML framework for each of them, such as neural architecture search (NAS) and automated pruning and quantization. We then cover efficient on-device training to enable user customization based on the local data on mobile devices. Apart from general acceleration techniques, we also showcase several task-specific accelerations for point cloud, video and natural language processing by exploiting their spatial sparsity and temporal/token redundancy. Finally, to support all these algorithmic advancements, we introduce the efficient deep learning system design from both software and hardware perspectives.

preprint2022arXiv

Floquet superradiance lattices in thermal atoms

Floquet modulation has been widely used in optical lattices for coherent control of quantum gases, in particular for synthesizing artificial gauge fields and simulating topological matters. However, such modulation induces heating which can overwhelm the signal of quantum dynamics in ultracold atoms. Here we report that the thermal motion, instead of being a noise source, provides a new control knob in Floquet-modulated superradiance lattices, which are momentum-space tight-binding lattices of collectively excited states of atoms. The Doppler shifts combined with Floquet modulation provide effective forces along arbitrary directions in a lattice in frequency and momentum dimensions. Dynamic localization, dynamic delocalization and chiral edge currents can be simultaneously observed from a single transport spectrum of superradiance lattices in thermal atoms. Our work paves a way for simulating Floquet topological matters in room-temperature atoms and facilitates their applications in photonic devices.

preprint2022arXiv

Lite Pose: Efficient Architecture Design for 2D Human Pose Estimation

Pose estimation plays a critical role in human-centered vision applications. However, it is difficult to deploy state-of-the-art HRNet-based pose estimation models on resource-constrained edge devices due to the high computational cost (more than 150 GMACs per frame). In this paper, we study efficient architecture design for real-time multi-person pose estimation on edge. We reveal that HRNet's high-resolution branches are redundant for models at the low-computation region via our gradual shrinking experiments. Removing them improves both efficiency and performance. Inspired by this finding, we design LitePose, an efficient single-branch architecture for pose estimation, and introduce two simple approaches to enhance the capacity of LitePose, including Fusion Deconv Head and Large Kernel Convs. Fusion Deconv Head removes the redundancy in high-resolution branches, allowing scale-aware feature fusion with low overhead. Large Kernel Convs significantly improve the model's capacity and receptive field while maintaining a low computational cost. With only 25% computation increment, 7x7 kernels achieve +14.0 mAP better than 3x3 kernels on the CrowdPose dataset. On mobile platforms, LitePose reduces the latency by up to 5.0x without sacrificing performance, compared with prior state-of-the-art efficient pose estimation models, pushing the frontier of real-time multi-person pose estimation on edge. Our code and pre-trained models are released at https://github.com/mit-han-lab/litepose.

preprint2022arXiv

Network Augmentation for Tiny Deep Learning

We introduce Network Augmentation (NetAug), a new training method for improving the performance of tiny neural networks. Existing regularization techniques (e.g., data augmentation, dropout) have shown much success on large neural networks by adding noise to overcome over-fitting. However, we found these techniques hurt the performance of tiny neural networks. We argue that training tiny models are different from large models: rather than augmenting the data, we should augment the model, since tiny models tend to suffer from under-fitting rather than over-fitting due to limited capacity. To alleviate this issue, NetAug augments the network (reverse dropout) instead of inserting noise into the dataset or the network. It puts the tiny model into larger models and encourages it to work as a sub-model of larger models to get extra supervision, in addition to functioning as an independent model. At test time, only the tiny model is used for inference, incurring zero inference overhead. We demonstrate the effectiveness of NetAug on image classification and object detection. NetAug consistently improves the performance of tiny models, achieving up to 2.2% accuracy improvement on ImageNet. On object detection, achieving the same level of performance, NetAug requires 41% fewer MACs on Pascal VOC and 38% fewer MACs on COCO than the baseline.

preprint2020arXiv

APQ: Joint Search for Network Architecture, Pruning and Quantization Policy

We present APQ for efficient deep learning inference on resource-constrained hardware. Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner. To deal with the larger design space it brings, a promising approach is to train a quantization-aware accuracy predictor to quickly get the accuracy of the quantized model and feed it to the search engine to select the best fit. However, training this quantization-aware accuracy predictor requires collecting a large number of quantized <model, accuracy> pairs, which involves quantization-aware finetuning and thus is highly time-consuming. To tackle this challenge, we propose to transfer the knowledge from a full-precision (i.e., fp32) accuracy predictor to the quantization-aware (i.e., int8) accuracy predictor, which greatly improves the sample efficiency. Besides, collecting the dataset for the fp32 accuracy predictor only requires to evaluate neural networks without any training cost by sampling from a pretrained once-for-all network, which is highly efficient. Extensive experiments on ImageNet demonstrate the benefits of our joint optimization approach. With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ. Compared to the separate optimization approach (ProxylessNAS+AMC+HAQ), APQ achieves 2.3% higher ImageNet accuracy while reducing orders of magnitude GPU hours and CO2 emission, pushing the frontier for green AI that is environmental-friendly. The code and video are publicly available.

preprint2020arXiv

Many-body chiral edge currents and sliding phases of atomic spinwaves in momentum-space lattice

Collective excitations (spinwaves) of long-lived atomic hyperfine states can be synthesized into a Bose-Hubbard model in momentum space. We explore many-body ground states and dynamics of a two-leg momentum-space lattice formed by two coupled hyperfine states. Essential ingredients of this setting are a staggered artificial magnetic field engineered by lasers that couple the spinwave states, and a state-dependent long-range interaction, which is induced by laser-dressing a hyperfine state to a Rydberg state. The Rydberg dressed two-body interaction gives rise to a state-dependent blockade in momentum space, and can amplify staggered flux induced anti-chiral edge currents in the many-body ground state in the presence of magnetic flux. When the Rydberg dressing is applied to both hyperfine states, exotic sliding insulating and superfluid/supersolid phases emerge. Due to the Rydberg dressed long-range interaction, spinwaves slide along a leg of the momentum-space lattice without costing energy. Our study paves a route to the quantum simulation of topological phases and exotic dynamics with interacting spinwaves of atomic hyperfine states in momentum-space lattice.

preprint2020arXiv

On Optimal Locally Repairable Codes and Generalized Sector-Disk Codes

Optimal locally repairable codes with information locality are considered. Optimal codes are constructed, whose length is also order-optimal with respect to a new bound on the code length derived in this paper. The length of the constructed codes is super-linear in the alphabet size, which improves upon the well known pyramid codes, whose length is only linear in the alphabet size. The recoverable erasure patterns are also analyzed for the new codes. Based on the recoverable erasure patterns, we construct generalized sector-disk (GSD) codes, which can recover from disk erasures mixed with sector erasures in a more general setting than known sector-disk (SD) codes. Additionally, the number of sectors in the constructed GSD codes is super-linear in the alphabet size, compared with known SD codes, whose number of sectors is only linear in the alphabet size.

preprint2020arXiv

Once-for-All: Train One Network and Specialize it for Efficient Deployment

We address the challenging problem of efficient inference across many devices and resource constraints, especially on edge devices. Conventional approaches either manually design or use neural architecture search (NAS) to find a specialized neural network and train it from scratch for each case, which is computationally prohibitive (causing $CO_2$ emission as much as 5 cars&#39; lifetime) thus unscalable. In this work, we propose to train a once-for-all (OFA) network that supports diverse architectural settings by decoupling training and search, to reduce the cost. We can quickly get a specialized sub-network by selecting from the OFA network without additional training. To efficiently train OFA networks, we also propose a novel progressive shrinking algorithm, a generalized pruning method that reduces the model size across many more dimensions than pruning (depth, width, kernel size, and resolution). It can obtain a surprisingly large number of sub-networks ($> 10^{19}$) that can fit different hardware platforms and latency constraints while maintaining the same level of accuracy as training independently. On diverse edge devices, OFA consistently outperforms state-of-the-art (SOTA) NAS methods (up to 4.0% ImageNet top1 accuracy improvement over MobileNetV3, or same accuracy but 1.5x faster than MobileNetV3, 2.6x faster than EfficientNet w.r.t measured latency) while reducing many orders of magnitude GPU hours and $CO_2$ emission. In particular, OFA achieves a new SOTA 80.0% ImageNet top-1 accuracy under the mobile setting ($<$600M MACs). OFA is the winning solution for the 3rd Low Power Computer Vision Challenge (LPCVC), DSP classification track and the 4th LPCVC, both classification track and detection track. Code and 50 pre-trained models (for many devices & many latency constraints) are released at https://github.com/mit-han-lab/once-for-all.

preprint2020arXiv

Topological phases of quantized light

Topological photonics is an emerging research area that focuses on the topological states of classical light. Here we reveal the topological phases that are intrinsic to the particle nature of light, i.e., solely related to the quantized Fock states and the inhomogeneous coupling between them. The Hamiltonian of two cavities coupled with a two-level atom is an intrinsic one-dimensional Su-Schriefer-Heeger model of Fock states. By adding another cavity, the Fock-state lattice is extended to two dimensions with a honeycomb structure, where the strain due to the inhomogeneity of the coupling strengths induces a Lifshitz topological phase transition between a semimetal and a band insulator. In the semimetallic phase, the strain is equivalent to a pseudomagnetic field, which results in the quantization of the Landau levels and the valley Hall effect. We further construct a Haldane model where the topological phases can be characterized by the topological markers. This study demonstrates a fundamental distinction between the topological phases of bosons and fermions and provides a novel platform for studying topological physics in dimensions higher than three.