Researcher profile

Liang Luo

Liang Luo contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2026arXiv

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.

preprint2023arXiv

Low-energy electrodynamics of infinite-layer nickelates: evidence for d-wave superconductivity in the dirty limit

The discovery of superconductivity in infinite-layer nickelates establishes a new category of unconventional superconductors that share structural and electronic similarities with cuprates. Despite exciting advances, such as the establishment of a cuprate-like phase diagram and the observation of charge order and short-range antiferromagnetic fluctuation, the key issues of superconducting pairing symmetry, gap amplitude, and superconducting fluctuation remain elusive. In this work, we utilize static and ultrafast terahertz spectroscopy to address these outstanding problems. We demonstrate that the equilibrium terahertz conductivity and nonequilibrium terahertz responses of an optimally Sr-doped nickelate film ($T_c$ = 17 K) are in line with the electrodynamics of $d$-wave superconductivity in the dirty limit. The gap-to-$T_c$ ratio 2$Δ/k_\mathrm{B}T_\mathrm{c}$ is extracted to be 3.4, indicating the superconductivity falls in the weak-coupling regime. In addition, we observed significant superconducting fluctuation near $T_\mathrm{c}$, while it does not extend into the deep normal state as optimally hole-doped cuprates. Our result highlights a new $d$-wave system which closely resembles the electron-doped cuprates, expanding the family of unconventional superconductivity in oxides.

preprint2022arXiv

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, we observe that the practical performance of those designs can vary from dataset to dataset, even when the order of interactions claimed to be captured is the same. That indicates different designs may have different advantages and the interactions captured by them have non-overlapping information. Motivated by this observation, we propose DHEN - a deep and hierarchical ensemble architecture that can leverage strengths of heterogeneous interaction modules and learn a hierarchy of the interactions under different orders. To overcome the challenge brought by DHEN's deeper and multi-layer structure in training, we propose a novel co-designed training system that can further improve the training efficiency of DHEN. Experiments of DHEN on large-scale dataset from CTR prediction tasks attained 0.27\% improvement on the Normalized Entropy (NE) of prediction and 1.2x better training throughput than state-of-the-art baseline, demonstrating their effectiveness in practice.

preprint2022arXiv

Srifty: Swift and Thrifty Distributed Training on the Cloud

Finding the best VM configuration is key to achieve lower cost and higher throughput, two primary concerns in cloud-based distributed neural network (NN) training today. Optimal VM selection that meets user constraints requires efficiently navigating a large search space while controlling for the performance variance associated with sharing cloud instances and networks. In this work, we characterize this variance in the context of distributed NN training and present results of a comprehensive throughput and cost-efficiency study we conducted across a wide array of instances to prune for the optimal VM search space. Using insights from these studies, we built Srifty, a system that combines runtime profiling with learned performance models to accurately predict training performance and find the best VM choice that satisfies user constraints, potentially leveraging both heterogeneous setups and spot instances. We integrated Srifty with PyTorch and evaluated it on Amazon EC2. We conducted a large-scale generalization study of Srifty across more than 2K training setups on EC2. Our results show that Srifty achieves an iteration latency prediction error of 8%, and its VM instance recommendations offer significant throughput gain and cost reduction while satisfying user constraints compared to existing solutions in complex, real-world scenarios.

preprint2022arXiv

Stochastic threshold in cell size control

Classic models of cell size control consider cells divide while reaching a threshold, e.g. size, age, or size extension. The molecular basis of the threshold involves multiple layers of regulation as well as gene noises. In this work, we study cell cycle as first-passage problem with stochastic threshold and discover such stochasticity affects the inter-division statistics, which bewilders the criteria to distinguish the types of size control models. The analytic results show the autocorrelation in the threshold can drive a sizer model to the adder-like and even timer-like inter-division statistics, which is supported by simulations. Following the picture that the autocorrelation in the threshold can propagate to the inter-division statistics, we further show that the adder model can be driven to the timer-like one by positive autocorrelated threshold, and even to the sizer-like one when the threshold is negatively autocorrelated. This work highlights the importance to examine gene noise in size control.

preprint2020arXiv

Ergodicity recovery of random walk in heterogeneous disordered media

Significant and persistent trajectory-to-trajectory variance are commonly observed in the particle tracking experiments, which have become a major challenge for the experiment data analysis. In this theoretical paper, we investigate the ergodicity recovery behavior, which helps to clarify the origin and the convergence of trajectory-to-trajectory fluctuation in various heterogeneous disordered media. The concepts of self-averaging and ergodicity are revisited in the context of trajectory analysis. The slow ergodicity recovery and the non-Gaussian diffusion in the annealed disordered media are shown as the consequences of the central limit theorem in different situations. The strange ergodicity recovery behavior is reported in the quenched disordered case, which arises from a localization mechanism. The first-passage approach is introduced to the ergodicity analysis for this case, of which the central limit theorem can be employed and the ergodicity is recovered in the length scale of diffusivity correlation.

preprint2020arXiv

Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN training is often seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger DNN models and faster compute engines shift the training performance bottleneck from computation to communication. Our experiments show existing DNN training frameworks do not scale in a typical cloud environment due to insufficient bandwidth and inefficient parameter server software stacks.We propose PBox, a balanced, scalable central PS hardware that balances compute and communication resources, and PHub, a high performance parameter server (PS) software design that provides an optimized network stack and a streamlined gradient processing pipeline to benefit common PS setups to utilize PBox. We show that in a typical cloud environment, PBox can achieve up to 3.8x speedup over state-of-the-art designs when training ImageNet. We discuss future directions of integrating PBox with programmable switches for in-network aggregation during training, leveraging the datacenter network topology to reduce bandwidth usage and localize data movement.

preprint2020arXiv

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to communication. This paper characterizes DDNN training to precisely pinpoint these bottlenecks. We found that timely training requires high performance parameter servers (PSs) with optimized network stacks and gradient processing pipelines, as well as server and network hardware with balanced computation and communication resources. We therefore propose PHub, a high performance multi-tenant, rack-scale PS design. PHub co-designs the PS software and hardware to accelerate rack-level and hierarchical cross-rack parameter exchange, with an API compatible with many DDNN training frameworks. PHub provides a performance improvement of up to 2.7x compared to state-of-the-art distributed training techniques for cloud-based ImageNet workloads, with 25% better throughput per dollar.

preprint2020arXiv

Quenched trap model on the extreme landscape: the rise of sub-diffusion and non-Gaussian diffusion

Non-Gaussian diffusion has been intensively studied in recent years, which reflects the dynamic heterogeneity in the disordered media. The recent study on the non-Gaussian diffusion in a static disordered landscape suggests novel phenomena due to the quenched disorder. In this work, we further investigate the random walk in this landscape under various effective temperature $μ$, which continuously modulates the dynamic heterogeneity. We show in the long time limit, the trap dynamics on the landscape is equivalent to the quenched trap model, in which sub-diffusion appears for $μ<1$. The non-Gaussian distribution of displacement has been analytically estimated for short $t$, of which the stretched exponential tail is expected for $μ\neq1$. Due to the localization in the ensemble of trajectory segments, an additional peak arises in $P(x,t)$ around $x=0$ even for $μ>1$. Evolving in different time scales, the peak and the tail of $P(x,t)$ are well split for a wide range of $t$. This theoretical study reveals the connections among the sub-diffusion, non-Gaussian diffusion, and the dynamic heterogeneity in the static disordered medium. It also offers an insight on how the cell would benefit from the quasi-static disordered structures.