Source author record

Qichen Liao

Qichen Liao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Machine Learning math.OC

Catalog footprint

What is connected

2works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

Quantization is essential for efficient large language model (LLM) inference, yet the dequantization step-converting low-bit weights back to high-precision for matrix multiplication has become a critical bottleneck on modern AI accelerators. On architectures with decoupled compute units (e.g., Ascend NPUs), dequantization operations can consume more cycles than the matrix multiplication itself, leaving the high-throughput tensor cores underutilized. This paper presents Multi-Scale Dequant (MSD), a quantization framework that removes weight/KV dequantization from the GEMM critical path. Instead of lifting low-bit weights to BF16 precision, MSD decomposes high-precision BF16 activations into multiple low-precision components, each of which can be multiplied directly with quantized weights via native hardware-accelerated GEMM. This approach shifts the computational paradigm from precision conversion to multi-scale approximation, avoiding INT8-to-BF16 weight conversion before GEMM. We instantiate MSD for two weight formats and derive tight error bounds for each. For INT8 weights (W4A16), two-pass INT8 decomposition achieves near 16 effective bits. For MXFP4 weights (W4A16), two-pass MXFP4 decomposition yields near 6.6 effective bits with error bound 1/64 per block surpassing single-pass MXFP8(5.24 bits) while maintaining the same effective GEMM compute time. We further derive closed-form latency and HBM traffic models showing that MSD avoids the Vector-Cube pipeline stall caused by dequantization and reduces KV cache HBM traffic by up to 2.5 times in attention. Numerical simulations on matrix multiplication and Flash Attention kernels confirm that MSD does not degrade accuracy compared to dequantization baselines, and in many settings achieves lower L2 error.

preprint2022arXiv

Fast Sinkhorn II: Collinear Triangular Matrix and Linear Time Accurate Computation of Optimal Transport

In our previous work [arXiv:2202.10042], the complexity of Sinkhorn iteration is reduced from $O(N^2)$ to the optimal $O(N)$ by leveraging the special structure of the kernel matrix. In this paper, we explore the special structure of kernel matrices by defining and utilizing the properties of the Lower-ColLinear Triangular Matrix (L-CoLT matrix) and Upper-ColLinear Triangular Matrix (U-CoLT matrix). We prove that (1) L/U-CoLT matrix-vector multiplications can be carried out in $O(N)$ operations; (2) both families of matrices are closed under the Hadamard product and matrix scaling. These properties help to alleviate two key difficulties for reducing the complexity of the Inexact Proximal point method (IPOT), and allow us to significantly reduce the number of iterations to $O(N)$. This yields the Fast Sinkhorn II (FS-2) algorithm for accurate computation of optimal transport with low algorithm complexity and fast convergence. Numerical experiments are presented to show the effectiveness and efficiency of our approach.