Researcher profile

Zhixin Wang

Zhixin Wang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
11works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

11 published item(s)

preprint2026arXiv

Accelerating Rectified Flow Models via Trajectory-Aware Caching

Diffusion and rectified flow (RF) models generate high-fidelity images and videos, but their iterative velocity-field evaluations are computationally expensive. Existing caching methods accelerate sampling by skipping timesteps, yet their coarse approximations introduce accumulated errors over long skip intervals and degrade quality under aggressive acceleration. We propose TACache (Trajectory-Aware Cache), a training-free acceleration framework following a skip-then-compensate paradigm. TACache performs an orthogonal decomposition of discrete velocity acceleration along the RF trajectory into a parallel component and an orthogonal residual, isolating the magnitude and directional sources of per-step approximation error. The framework operates in two stages: offline, cumulative variation thresholds on the magnitude and direction indicators yield the skip schedule and bound how far each skip interval may extend; online, at each skipped step the offline statistics are combined with the sample's historical orthogonal direction to reconstruct the skipped velocity without additional model evaluations. Experiments on BAGEL, FLUX.1-dev, and Wan2.1-1.3B show that TACache achieves up to 4.14 speedup on text-to-image generation and 2.11 speedup on text-to-video generation, with consistent improvements over prior cache-based methods on all reference-based fidelity metrics. Code will be released soon.

preprint2026arXiv

Fast Image Super-Resolution via Consistency Rectified Flow

Diffusion models (DMs) have demonstrated remarkable success in real-world image super-resolution (SR), yet their reliance on time-consuming multi-step sampling largely hinders their practical applications. While recent efforts have introduced few- or single-step solutions, existing methods either inefficiently model the process from noisy input or fail to fully exploit iterative generative priors, compromising the fidelity and quality of the reconstructed images. To address this issue, we propose FlowSR, a novel approach that reformulates the SR problem as a rectified flow from low-resolution (LR) to high-resolution (HR) images. Our method leverages an improved consistency learning strategy to enable high-quality SR in a single step. Specifically, we refine the original consistency distillation process by incorporating HR regularization, ensuring that the learned SR flow not only enforces self-consistency but also converges precisely to the ground-truth HR target. Furthermore, we introduce a fast-slow scheduling strategy, where adjacent timesteps for consistency learning are sampled from two distinct schedulers: a fast scheduler with fewer timesteps to improve efficiency, and a slow scheduler with more timesteps to capture fine-grained texture details. Extensive experiments demonstrate that FlowSR achieves outstanding performance in both efficiency and image quality.

preprint2026arXiv

G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning. This assumption does not hold for UMMs, where understanding-side visual tokens must also preserve the model's capabilities for editing images. We propose G$^2$TR, a generation-guided visual token reduction framework for separate-encoder UMMs. Our key insight is that the generation branch provides a task-agnostic signal for identifying understanding-side visual tokens that are not only semantically relevant but also important for latent-space image reconstruction and generation. G$^2$TR estimates token importance from consistency with VAE latent, performs balanced token selection, and merges redundant tokens into retained representatives to reduce information loss. The method is training-free, plug-and-play, and applied only after the understanding encoding stage, making it compatible with existing UMM inference pipelines. Experiments on image understanding and editing benchmarks show that G$^2$TR substantially reduces visual tokens and prefill computation by 1.94x while maintaining both reasoning accuracy and editing quality, outperforming baselines on almost all benchmarks. Code is at: https://github.com/lijunxian111/G2TR.

preprint2026arXiv

How Much is Brain Data Worth for Machine Learning?

If a person can solve a task, can measuring their brain make it easier to train a model to solve that task too? Recent NeuroAI work suggests that supplementing task training with neural recordings can modestly improve model performance and robustness. However, it is unclear when there should be a benefit from using neural data and how much benefit to expect. We formulate this question mathematically, and begin to address it theoretically using a simple, analytically tractable linear gaussian model of task targets and neural recordings. For a multimodal estimator trained on both brain data and task labels, we derive scaling laws for how performance scales with the numbers of brain and task samples. From these laws we derive relative value and exchange rates between brain samples and task samples, quantifying how much extra task samples neural data is worth as a function of task-brain alignment, neural and task noise, latent dimension, and brain data sample size. We also analyze test distribution shift, to identify conditions where brain-regularized learning can produce substantial robustness gains through learned invariances. Finally, under a fixed collection budget, we characterize the regimes in which brain data is worth collecting. Our results provide a foundation for understanding how valuable brain data could be for improving machine learning.

preprint2026arXiv

PermuQuant: Lowering Per-Group Quantization Error by Reordering Channels for Diffusion Models

Large-scale visual generative models have achieved remarkable performance. However, their high computational and memory costs make deployment challenging in resource-constrained scenarios, such as interactive applications and personal single-GPU usage. Post-training quantization (PTQ) offers a practical solution by compressing pretrained models without expensive retraining. However, existing PTQ methods still suffer from severe quality degradation under extremely low-bit settings. In this paper, we identify channel ordering as an important but underexplored factor in per-group quantization. In this setting, each contiguous group shares one quantization scale. When channels with very different statistics are placed in the same group, the scale can be dominated by outliers and cause large quantization errors. Based on this observation, we propose PermuQuant, a simple and effective PTQ framework for low-bit diffusion models. PermuQuant sorts channels by a joint second-moment criterion before per-group quantization, placing channels with similar activation and weight statistics into the same group. It further uses a calibration-based acceptance rule to apply reordering only when the selected permutation reduces quantization error on calibration data. The selected permutations are absorbed into adjacent modules or applied to weights offline, avoiding explicit runtime permutation operations. Extensive experiments on multiple large diffusion models show that PermuQuant consistently reduces quantization error and outperforms existing PTQ baselines. On FLUX.1-dev with an RTX 5090, PermuQuant achieves up to a 1.8$\times$ single step speedup and reduces the DiT memory footprint by 3.5$\times$ under W4A4 NVFP4 quantization. Code will be available at https://github.com/yscheng04/PermuQuant.

preprint2026arXiv

YOSE: You Only Select Essential Tokens for Efficient DiT-based Video Object Removal

Recent advances in Diffusion Transformer (DiT)-based video generation technologies have shown impressive results for video object removal. However, these methods still suffer from substantial inference latency. For instance, although MiniMax Remover achieves state-of-the-art visual quality, it operates at only around 10FPS, primarily due to dense computations over the entire spatiotemporal token space, even when only a small masked region actually requires processing. In this paper, we present YOSE, You Only Select Essential Tokens, an efficient fine-tuning framework. YOSE introduces two key components: Batch Variable-length Indexing (BVI) and Diffusion Process Simulator (DiffSim) Module. BVI is a differentiable dynamic indexing operator that adaptively selects essential tokens based on mask information, enabling variable-length token processing across samples. DiffSim provides a diffusion process approximation mechanism for unmasked tokens, which simulates the influence of unmasked regions within DiT self-attention to maintain semantic consistency for masked tokens. With these designs, YOSE achieves mask-aware acceleration, where the inference time scales approximately linearly with the masked regions, in contrast to full-token diffusion methods whose computation remains constant regardless of the mask size. Extensive experiments demonstrate that YOSE achieves up to 2.5X speedup in 70% of cases while maintaining visual quality comparable to the baseline. Code is available at: https://github.com/Wucy0519/YOSE-CVPR26.

preprint2021arXiv

Ultra-low threshold lasing through phase front engineering via a metallic circular aperture

Semiconductor lasers with ultra-low thresholds and minimal footprints are a topic of active research. Such devices require a combination of high quality factor laser cavities with small active region volumes, which drives the quest for novel cavity geometries exploiting nano-optic concepts. For high-reflectivity coated ridge lasers, where light is tightly confined in the waveguide, a low threshold can only be achieved by strongly reducing the diffraction losses arising at the laser facet. We show here that, somewhat counter-intuitively, opening a carefully designed aperture in a metallic facet coating can simultaneously enhance both its transmission and modal reflectivity by correcting the phase front at the subwavelength scale. Numerical simulations and experimental results demonstrate a reduction of optical mirror loss by up to 40% while the transmission is increased by four orders of magnitude. Applying this approach to both facets of a short cavity quantum cascade laser, we achieve laser operation at room temperature with an electrical dissipation of only 143 mW. Such light sources are especially suitable for portable and battery-operated chemical agent sensing applications operating in the mid-infrared wavelength range, where multiple greenhouse and pollutant gases have their fundamental absorption lines. Our work suggests possibilities for further applications including frequency comb dispersion engineering, and can be implemented in a broad range of optoelectronic systems.

preprint2020arXiv

A Changing Dichotomy: The Conception of the "Macroscopic" and "Microscopic" Worlds in the History of Physics

This short essay traces the conceptual history of micro- and macroscopicity in the context of physical science. By focusing on three distinct episodes spanning five centuries, we show the scientific and philosophical meanings of this antonym pair, despite never being far from "the small" and "the large," have been evolving as the frontier of science advances. We analyze the intellectual and material impetus for these movements, and conclude that this conceptual history reflects the changing interaction between the natural world and humankind.

preprint2020arXiv

Topological charge of finite-size photonic crystal modes

Topological charges are the winding numbers of polarization vectors around the vortex centers of far-field radiation. In this work, the topological charge of photonic crystal modes is theoretically analyzed using an envelope function approach. A group of modes is discovered with unique polarization properties dictated by their non-trivial envelope functions. Experimentally, lasing operation on such mode is demonstrated in an electrically pumped mid-infrared photonic crystal surface-emitting laser with high slope efficiency. The topological charge is directly observed from the polarization properties of single-mode laser emission.

preprint2019arXiv

Heralded Generation and Detection of Entangled Microwave--Optical Photon Pairs

Quantum state transfer between microwave and optical frequencies is essential for connecting superconducting quantum circuits to coherent optical systems and extending microwave quantum networks over long distances. To build such a hybrid `quantum Internet,' an important experiment in the quantum regime is to entangle microwave and optical modes. Based on the model of a generic cavity electro-optomechanical system, we present a heralded scheme to generate entangled microwave--optical photon pairs, which can bypass the efficiency threshold for quantum channel capacity in direct transfer protocols. The parameter regime for entanglement verification is identified that is compatible with realistic experimental settings. Our scheme is feasible given the latest experimental progress on electro-optomechanics, and can be potentially generalized to various physical systems.