Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
16topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2026arXiv

Forcing-KV: Hybrid KV Cache Compression for Efficient Autoregressive Video Diffusion Models

Autoregressive (AR) video diffusion models adopt a streaming generation framework, enabling long-horizon video generation with real-time responsiveness, as exemplified by the Self Forcing training paradigm. However, existing AR video diffusion models still suffer from significant attention complexity and severe memory overhead due to the redundant key-value (KV) caches across historical frames, which limits scalability. In this paper, we tackle this challenge by introducing KV cache compression into autoregressive video diffusion. We observe that attention heads in mainstream AR diffusion models exhibit markedly distinct attention patterns and functional roles that remain stable across samples and denoising steps. Building on our empirical study of head-wise functional specialization, we divide the attention heads into two categories: static heads, which focus on transitions across autoregressive chunks and intra-frame fidelity, and dynamic heads, which govern inter-frame motion and consistency. We then propose Forcing-KV, a hybrid KV cache compression strategy that performs structured static pruning for static heads and dynamic pruning based on segment-wise similarity for dynamic heads. While maintaining output quality, our method achieves a generation speed of over 29 frames per second on a single NVIDIA H200 GPU along with 30% cache memory reduction, delivering up to 1.35x and 1.50x speedups on LongLive and Self Forcing at 480P resolution, and further scaling to 2.82x speedup at 1080P resolution. Code and demo videos are provided at https://zju-jiyicheng.github.io/Forcing-KV-Page.

preprint2026arXiv

The grip of grammar on meaning uncertainty: cross-linguistic evidence, neural correlates, and clinical relevance

Isolated word meanings are inherently uncertain. This uncertainty reduces when they are combined and anchored in context. We propose that grammar compresses meaning uncertainty cross-linguistically, which is reflected in brain and selectively disrupted in disorders. Compression was operationalized as the relative difference between non-contextual surprisal estimated from lexical frequency, and contextual surprisal from grammar-sensitive models. In narratives from 20 languages, contextual surprisal reduced frequency-based surprisal. This reduction closely tracked the surprisal cost of reversing word order, and scaled with richer, non-redundant lexis as organized by more complex but optimal dependency structure. During fMRI, surprisal and its reduction explained BOLD activity for comprehension and production in overlapping but distinct regions. Uncertainty reduction was significantly attenuated in aphasia, dementia, and schizophrenia, but remained intact where primary deficit is not language. These findings position uncertainty reduction via grammar as a foundational concept that illuminates principles, brain basis, and disruptions of language.

preprint2022arXiv

Bridging the Gap: Commonality and Differences between Online and Offline COVID-19 Data

With the onset of the COVID-19 pandemic, news outlets and social media have become central tools for disseminating and consuming information. Because of their ease of access, users seek COVID-19-related information from online social media (i.e., online news) and news outlets (i.e., offline news). Online and offline news are often connected, sharing common topics while each has unique, different topics. A gap between these two news sources can lead to misinformation propagation. For instance, according to the Guardian, most COVID-19 misinformation comes from users on social media. Without fact-checking social media news, misinformation can lead to health threats. In this paper, we focus on the novel problem of bridging the gap between online and offline data by monitoring their common and distinct topics generated over time. We employ Twitter (online) and local news (offline) data for a time span of two years. Using online matrix factorization, we analyze and study online and offline COVID-19-related data differences and commonalities. We design experiments to show how online and offline data are linked together and what trends they follow.

preprint2022arXiv

Long-range transport of 2D excitons with acoustic waves

Excitons are elementary optical excitation in semiconductors. The ability to manipulate and transport these quasiparticles would enable excitonic circuits and devices for quantum photonic technologies. Recently, interlayer excitons in 2D semiconductors have emerged as a promising candidate for engineering excitonic devices due to their long lifetime, large exciton binding energy, and gate tunability. However, the charge-neutral nature of the excitons leads to weak response to the in-plane electric field and thus inhibits transport beyond the diffusion length. Here, we demonstrate the directional transport of interlayer excitons in bilayer WSe2 driven by the propagating potential traps induced by surface acoustic waves (SAW). We show that at 100 K, the SAW-driven excitonic transport is activated above a threshold acoustic power and reaches 20 mm, a distance at least ten times longer than the diffusion length and only limited by the device size. Temperature-dependent measurement reveals the transition from the diffusion-limited regime at low temperature to the acoustic field-driven regime at elevated temperature. Our work shows that acoustic waves are an effective, contact-free means to control exciton dynamics and transport, promising for realizing 2D materials-based excitonic devices such as exciton transistors, switches, and transducers up to room temperature.

preprint2022arXiv

On Regularity Lemma and Barriers in Streaming and Dynamic Matching

We present a new approach for finding matchings in dense graphs by building on Szemerédi's celebrated Regularity Lemma. This allows us to obtain non-trivial albeit slight improvements over longstanding bounds for matchings in streaming and dynamic graphs. In particular, we establish the following results for $n$-vertex graphs: * A deterministic single-pass streaming algorithm that finds a $(1-o(1))$-approximate matching in $o(n^2)$ bits of space. This constitutes the first single-pass algorithm for this problem in sublinear space that improves over the $\frac{1}{2}$-approximation of the greedy algorithm. * A randomized fully dynamic algorithm that with high probability maintains a $(1-o(1))$-approximate matching in $o(n)$ worst-case update time per each edge insertion or deletion. The algorithm works even against an adaptive adversary. This is the first $o(n)$ update-time dynamic algorithm with approximation guarantee arbitrarily close to one. Given the use of regularity lemma, the improvement obtained by our algorithms over trivial bounds is only by some $(\log^*{n})^{Θ(1)}$ factor. Nevertheless, in each case, they show that the ``right'' answer to the problem is not what is dictated by the previous bounds. Finally, in the streaming model, we also present a randomized $(1-o(1))$-approximation algorithm whose space can be upper bounded by the density of certain Ruzsa-Szemerédi (RS) graphs. While RS graphs by now have been used extensively to prove streaming lower bounds, ours is the first to use them as an upper bound tool for designing improved streaming algorithms.

preprint2022arXiv

Sublinear Algorithms for Hierarchical Clustering

Hierarchical clustering over graphs is a fundamental task in data mining and machine learning with applications in domains such as phylogenetics, social network analysis, and information retrieval. Specifically, we consider the recently popularized objective function for hierarchical clustering due to Dasgupta. Previous algorithms for (approximately) minimizing this objective function require linear time/space complexity. In many applications the underlying graph can be massive in size making it computationally challenging to process the graph even using a linear time/space algorithm. As a result, there is a strong interest in designing algorithms that can perform global computation using only sublinear resources. The focus of this work is to study hierarchical clustering for massive graphs under three well-studied models of sublinear computation which focus on space, time, and communication, respectively, as the primary resources to optimize: (1) (dynamic) streaming model where edges are presented as a stream, (2) query model where the graph is queried using neighbor and degree queries, (3) MPC model where the graph edges are partitioned over several machines connected via a communication channel. We design sublinear algorithms for hierarchical clustering in all three models above. At the heart of our algorithmic results is a view of the objective in terms of cuts in the graph, which allows us to use a relaxed notion of cut sparsifiers to do hierarchical clustering while introducing only a small distortion in the objective function. Our main algorithmic contributions are then to show how cut sparsifiers of the desired form can be efficiently constructed in the query model and the MPC model. We complement our algorithmic results by establishing nearly matching lower bounds that rule out the possibility of designing better algorithms in each of these models.

preprint2022arXiv

Variance Reduced EXTRA and DIGing and Their Optimal Acceleration for Strongly Convex Decentralized Optimization

We study stochastic decentralized optimization for the problem of training machine learning models with large-scale distributed data. We extend the widely used EXTRA and DIGing methods with variance reduction (VR), and propose two methods: VR-EXTRA and VR-DIGing. The proposed VR-EXTRA requires the time of $O((κ_s+n)\log\frac{1}ε)$ stochastic gradient evaluations and $O((κ_b+κ_c)\log\frac{1}ε)$ communication rounds to reach precision $ε$, which are the best complexities among the non-accelerated gradient-type methods, where $κ_s$ and $κ_b$ are the stochastic condition number and batch condition number for strongly convex and smooth problems, respectively, $κ_c$ is the condition number of the communication network, and $n$ is the sample size on each distributed node. The proposed VR-DIGing has a little higher communication cost of $O((κ_b+κ_c^2)\log\frac{1}ε)$. Our stochastic gradient computation complexities are the same as the ones of single-machine VR methods, such as SAG, SAGA, and SVRG, and our communication complexities keep the same as those of EXTRA and DIGing, respectively. To further speed up the convergence, we also propose the accelerated VR-EXTRA and VR-DIGing with both the optimal $O((\sqrt{nκ_s}+n)\log\frac{1}ε)$ stochastic gradient computation complexity and $O(\sqrt{κ_bκ_c}\log\frac{1}ε)$ communication complexity. Our stochastic gradient computation complexity is also the same as the ones of single-machine accelerated VR methods, such as Katyusha, and our communication complexity keeps the same as those of accelerated full batch decentralized methods, such as MSDA.

preprint2020arXiv

AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite

Domain-specific software and hardware co-design is encouraging as it is much easier to achieve efficiency for fewer tasks. Agile domain-specific benchmarking speeds up the process as it provides not only relevant design inputs but also relevant metrics, and tools. Unfortunately, modern workloads like Big data, AI, and Internet services dwarf the traditional one in terms of code size, deployment scale, and execution path, and hence raise serious benchmarking challenges. This paper proposes an agile domain-specific benchmarking methodology. Together with seventeen industry partners, we identify ten important end-to-end application scenarios, among which sixteen representative AI tasks are distilled as the AI component benchmarks. We propose the permutations of essential AI and non-AI component benchmarks as end-to-end benchmarks. An end-to-end benchmark is a distillation of the essential attributes of an industry-scale application. We design and implement a highly extensible, configurable, and flexible benchmark framework, on the basis of which, we propose the guideline for building end-to-end benchmarks, and present the first end-to-end Internet service AI benchmark. The preliminary evaluation shows the value of our benchmark suite---AIBench against MLPerf and TailBench for hardware and software designers, micro-architectural researchers, and code developers. The specifications, source code, testbed, and results are publicly available from the web site \url{http://www.benchcouncil.org/AIBench/index.html}.

preprint2020arXiv

An Efficient PTAS for Stochastic Load Balancing with Poisson Jobs

We give the first polynomial-time approximation scheme (PTAS) for the stochastic load balancing problem when the job sizes follow Poisson distributions. This improves upon the 2-approximation algorithm due to Goel and Indyk (FOCS'99). Moreover, our approximation scheme is an efficient PTAS that has a running time double exponential in $1/ε$ but nearly-linear in $n$, where $n$ is the number of jobs and $ε$ is the target error. Previously, a PTAS (not efficient) was only known for jobs that obey exponential distributions (Goel and Indyk, FOCS'99). Our algorithm relies on several probabilistic ingredients including some (seemingly) new results on scaling and the so-called "focusing effect" of maximum of Poisson random variables which might be of independent interest.

preprint2020arXiv

Decentralized Accelerated Gradient Methods With Increasing Penalty Parameters

In this paper, we study the communication and (sub)gradient computation costs in distributed optimization and give a sharp complexity analysis for the proposed distributed accelerated gradient methods. We present two algorithms based on the framework of the accelerated penalty method with increasing penalty parameters. Our first algorithm is for smooth distributed optimization and it obtains the near optimal $O\left(\sqrt{\frac{L}{ε(1-σ_2(W))}}\log\frac{1}ε\right)$ communication complexity and the optimal $O\left(\sqrt{\frac{L}ε}\right)$ gradient computation complexity for $L$-smooth convex problems, where $σ_2(W)$ denotes the second largest singular value of the weight matrix $W$ associated to the network and $ε$ is the target accuracy. When the problem is $μ$-strongly convex and $L$-smooth, our algorithm has the near optimal $O\left(\sqrt{\frac{L}{μ(1-σ_2(W))}}\log^2\frac{1}ε\right)$ complexity for communications and the optimal $O\left(\sqrt{\frac{L}μ}\log\frac{1}ε\right)$ complexity for gradient computations. Our communication complexities are only worse by a factor of $\left(\log\frac{1}ε\right)$ than the lower bounds for the smooth distributed optimization. %As far as we know, our method is the first to achieve both communication and gradient computation lower bounds up to an extra logarithm factor for smooth distributed optimization. Our second algorithm is designed for non-smooth distributed optimization and it achieves both the optimal $O\left(\frac{1}{ε\sqrt{1-σ_2(W)}}\right)$ communication complexity and $O\left(\frac{1}{ε^2}\right)$ subgradient computation complexity, which match the communication and subgradient computation complexity lower bounds for non-smooth distributed optimization.

preprint2020arXiv

Revisiting EXTRA for Smooth Distributed Optimization

EXTRA is a popular method for dencentralized distributed optimization and has broad applications. This paper revisits EXTRA. First, we give a sharp complexity analysis for EXTRA with the improved $O\left(\left(\frac{L}μ+\frac{1}{1-σ_2(W)}\right)\log\frac{1}{ε(1-σ_2(W))}\right)$ communication and computation complexities for $μ$-strongly convex and $L$-smooth problems, where $σ_2(W)$ is the second largest singular value of the weight matrix $W$. When the strong convexity is absent, we prove the $O\left(\left(\frac{L}ε+\frac{1}{1-σ_2(W)}\right)\log\frac{1}{1-σ_2(W)}\right)$ complexities. Then, we use the Catalyst framework to accelerate EXTRA and obtain the $O\left(\sqrt{\frac{L}{μ(1-σ_2(W))}}\log\frac{ L}{μ(1-σ_2(W))}\log\frac{1}ε\right)$ communication and computation complexities for strongly convex and smooth problems and the $O\left(\sqrt{\frac{L}{ε(1-σ_2(W))}}\log\frac{1}{ε(1-σ_2(W))}\right)$ complexities for non-strongly convex ones. Our communication complexities of the accelerated EXTRA are only worse by the factors of $\left(\log\frac{L}{μ(1-σ_2(W))}\right)$ and $\left(\log\frac{1}{ε(1-σ_2(W))}\right)$ from the lower complexity bounds for strongly convex and non-strongly convex problems, respectively.

preprint2020arXiv

Structural transition, metallization and superconductivity in quasi 2D layered PdS$_2$ under compression

Based on first-principles simulations and calculations, we explore the evolution of crystal structure, electronic structure and transport properties of quasi 2D layered PdS2 under uniaxial stress and hydrostatic pressure. The coordination of the Pd ions plays crucial roles in the structural transition, electronic structure and transport properties of PdS2. An interesting ferroelastic phase transition with lattice reorientation is revealed under uniaxial compressive stress, which originates from the bond reconstructions of the unusual PdS4 square-planar coordination. By contrast, the layered structure transforms to 3D cubic pyrite-type structure under hydrostatic pressure. In contrast to the experimental proposed coexistence of layered PdS2-type structure with cubic pyrite-type structure at intermediate pressure range, we predict that the compression-induced intermediate phase showing the same structural symmetry with the ambient phase, except of sharply contracted interlayer-distances. The coordination environments of the Pd ions have changed from square-planar to distorted octahedra in the intermediate phase, which results in the bandwidth broaden and orbital-selective metallization. In addition, the superconductivity comes from the cubic pyrite-type structure protected topological nodal-line states. The strong correlations between structural transition, electronic structure and transport properties in PdS2 provide a platform to study the fundamental physics of the interplay between crystal structure and transport behavior, and the competition between diverse phases.

preprint2019arXiv

Valence transition in topological Kondo insulator

We investigate the valence transition in three-dimensional topological Kondo insulator through slave-boson analysis of periodic Anderson model. By including the effect of intra-atomic Coulomb correlation $U_{fc}$ between conduction and local electrons, we find a first-order valence transition from Kondo region to mixed valence upon ascending of local level above a critical $U_{fc}$, and this valence transition usually occurs very close to or simultaneously with a topological transition. Near the parameter region of zero-temperature valence transition, rise of temperature can generate a thermal valence transition from mixed valence to Kondo region, accompanied by a first-order topological transition. Remarkably, above a critical $U_{fc}$ which is considerable smaller than that generating paramagnetic valence transition, the original continuous antiferromagnetic transition is shifted to first order one, at which a discontinuous valence shift takes place. Upon increased $U_{fc}$, the paramagnetic valence transition approaches then converges with the first-order antiferromagnetic transition, leaving an significant valence shift on the magnetic boundary. The continuous antiferromagnetic transition, first-order antiferromagnetic transition, paramagnetic valence transition and topological transitions are all summarized in a global phase diagram. Our proposed exotic transition processes can help to understand the thermal valence variation as well as the valence shift around the pressure-induced magnetic transition in topological Kondo insulator candidates and in other heavy-fermion systems.