Researcher profile

Kunyang Li

Kunyang Li contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2026arXiv

Aero-World: Action-Conditioned Aerial Video Generation from Inertial Controls

Foundation video models produce visually impressive results, but their use in embodied AI remains limited because they are primarily trained on natural language rather than low-level control signals. This limitation is especially pronounced for aerial flight, where motion occurs in unconstrained 6-DoF space and small errors in ego-motion can produce large trajectory drift. Generating aerial videos that follow fine-grained inertial actions can support scalable training and evaluation of aerial agents by providing a controllable proxy for real-world or expensive simulation data. To address this problem, we propose \textbf{Aero-World}, a method for converting a pretrained image-to-video diffusion model into a controllable aerial video generator. Aero-World injects sequences of translational acceleration and angular velocity into a pretrained latent diffusion transformer through an action-token stream. A frozen latent-space Physics Probe, trained independently on real video--IMU pairs, provides differentiable inertial-consistency supervision during LoRA finetuning while avoiding computationally expensive video decoding. We further propose \textbf{AeroBench}, a benchmark for evaluating whether generated drone videos adhere to low-level action signals. AeroBench uses Action Alignment Score (AAS) to measure agreement with commanded inertial actions and Physical Consistency Rate (PCR) to measure temporal motion stability. On AeroBench, Aero-World improves mean AAS from 57.7 to 63.6 over action-only finetuning and gives a stronger quality-control trade-off than AirScape, with lower FVD (596.5 vs. 1058.6), higher SSIM (0.595 vs. 0.505), and higher Flow-IMU correlation (0.44 vs. 0.20). These results suggest that frozen Physics Probe supervision is a practical mechanism for adapting pretrained video generators toward more action-aligned aerial motion.

preprint2026arXiv

Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory usage due to key-value caching, which limits its scalability to long video horizons. Existing remedies (e.g., sparse attention and KV-cache compression) reduce per-step cost but still rely on a linearly growing cache or irreversibly discard past context, and thus fail to address linear memory growth and streaming context management. To address this scalability bottleneck, we propose ARL2 (Attend Locally, Remember Linearly), a hybrid attention module that replaces quadratic cross-frame attention with a fixed-size recurrent state. We decompose self-attention into two branches: an intra-frame softmax branch for spatial detail and local dependencies, and an inter-frame gated recurrent linear branch that maintains a fixed-size state for streaming context. Our key insight is that softmax attention captures fine-grained local interactions, while a recurrent state provides controllable long-range memory. This design achieves linear-time scaling with constant memory while improving temporal consistency over the full-softmax model. To prevent noisy intermediate states from corrupting memory, we update the recurrent state only after the denoised pass. To avoid within-frame information asymmetry, all tokens share the same pre-update state rather than sequential updates. To the best of our knowledge, this is the first work to convert a pretrained AR video diffusion model into a hybrid linear attention architecture, through an efficient two-stage training scheme for AR video. With 75% of layers replaced by hybrid linear attention, the model achieves up to 2.26 wall-clock speedup and 54% memory reduction, while maintaining comparable quality with improving temporal consistency.

preprint2026arXiv

PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache

A unified autoregressive model is a Transformer-based framework that addresses diverse multimodal tasks (e.g., text, image, video) as a single sequence modeling problem under a shared token space. Such models rely on the KV-cache mechanism to reduce attention computation from O(T^2) to O(T); however, KV-cache size grows linearly with the number of generated tokens, and it rapidly becomes the dominant bottleneck limiting inference efficiency and generative length. Unified autoregressive video generation inherits this limitation. Our analysis reveals that KV-cache tokens exhibit distinct spatiotemporal properties: (i) text and conditioning-image tokens act as persistent semantic anchors that consistently receive high attention, and (ii) attention to previous frames naturally decays with temporal distance. Leveraging these observations, we introduce PackCache, a training-free KV-cache management method that dynamically compacts the KV cache through three coordinated mechanisms: condition anchoring that preserves semantic references, cross-frame decay modeling that allocates cache budget according to temporal distance, and spatially preserving position embedding that maintains coherent 3D structure under cache removal. In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences, showcasing its strong potential for enabling longer-sequence video generation. Notably, the final four frames - the portion most impacted by the progressively expanding KV-cache and thus the most expensive segment of the clip - PackCache delivers a 2.6x and 3.7x acceleration on A40 and H200, respectively, for 48-frame videos.

preprint2022arXiv

Massive Black Hole Binaries from the TNG50-3 Simulation: I. Coalescence and LISA Detection Rates

We evaluate the cosmological coalescence and detection rates for massive black hole (MBH) binaries targeted by the gravitational wave observatory Laser Interferometer Space Antenna (LISA). Our calculation starts with a population of gravitationally unbound MBH pairs, drawn from the TNG50-3 cosmological simulation, and follows their orbital evolution from kpc scales all the way to coalescence using a semi-analytic model developed in our previous work. We find that for a majority of MBH pairs that coalesce within a Hubble time dynamical friction is the most important mechanism that determines their coalescence rate. Our model predicts a MBH coalescence rate < 0.45/ yr and a LISA detection rate < 0.34/ yr. Most LISA detections should originate from 10^6 - 10^6.8 solar masses MBHs in gas-rich galaxies at redshifts 1.6 < z < 2.4, and have a characteristic signal to noise ratio SNR ~ 100. We however find a dramatic reduction in the coalescence and detection rates, as well as the average SNR, if the effects of radiative feedback from accreting MBHs are taken into account. In this case, the MBH coalescence rate is reduced by 78% (to < 0.1/ yr), and the LISA detection rate is reduced by 94% (to 0.02/ yr), whereas the average SNR is ~ 10. We emphasize that our model provides a lower limit on the LISA detection rate, consistent with other works in the literature that draw their MBH pairs from cosmological simulations.

preprint2022arXiv

Massive Black Hole Binaries from the TNG50-3 Simulation: II. Using Dual AGNs to Predict the Rate of Black Hole Mergers

Dual active galaxy nuclei (dAGNs) trace the population of post-merger galaxies and are the precursors to massive black hole (MBH) mergers, an important source of gravitational waves that may be observed by LISA. In Paper I of this series, we used the population of nearly 2000 galaxy mergers predicted by the TNG50-3 simulation to seed semi-analytic models of the orbital evolution and coalescence of MBH pairs with initial separations of about 1 kpc. Here, we calculate the dAGN luminosities and separation of these pairs as they evolve in post-merger galaxies, and show how the coalescence fraction of dAGNs changes with redshift. We find that because of the several Gyr long dynamical friction timescale for orbital evolution, the fraction of dAGNs that eventually end in a MBH merger grows with redshift and does not pass 50% until a redshift of 1. However, dAGNs in galaxies with bulge masses >10^10 solar masses, or comprised of near-equal mass MBHs, evolve more quickly and have higher than average coalescence fractions. At any redshift, dAGNs observed with small separations (> 0.7 kpc) have a higher probability of merging in a Hubble time than more widely separated systems. As found in Paper I, radiation feedback effects can significantly reduce the number of MBH mergers, and this could be manifested as a larger than expected number of widely separated dAGNs. We present a method to estimate the MBH coalescence rate as well as the potential LISA detection rate given a survey of dAGNs. Comparing these rates to the eventual LISA measurements will help determine the efficiency of dynamical friction in post-merger galaxies.