Researcher profile

Kaiwen Zheng

Kaiwen Zheng contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2026arXiv

Are Multimodal Embeddings Truly Beneficial for Recommendation? A Deep Dive into Whole vs. Individual Modalities

Multimodal recommendation has emerged as a mainstream paradigm, typically leveraging text and visual embeddings extracted from pre-trained models such as Sentence-BERT, Vision Transformers, and ResNet. This approach is founded on the intuitive assumption that incorporating multimodal embeddings can enhance recommendation performance. However, despite its popularity, this assumption lacks comprehensive empirical verification. This presents a critical research gap. To address it, we pose the central research question of this paper: Are multimodal embeddings truly beneficial for recommendation? To answer this question, we conduct a large-scale empirical study examining the role of text and visual embeddings in modern multimodal recommendation models, both as a whole and individually. Specifically, we pose two key research questions: (1) Do multimodal embeddings as a whole improve recommendation performance? (2) Is each individual modality - text and image - useful when used alone? To isolate the effect of individual modalities - text or visual - we employ a modality knockout strategy by setting the corresponding embeddings to either constant values or random noise. To ensure the scale and comprehensiveness of our study, we evaluate 14 widely used state-of-the-art multimodal recommendation models. Our findings reveal that: (1) multimodal embeddings generally enhance recommendation performance - particularly when integrated through more sophisticated graph-based fusion models. Surprisingly, commonly adopted baseline models with simple fusion schemes, such as VBPR and BM3, show only limited gains. (2) The text modality alone achieves performance comparable to the full multimodal setting in most cases, whereas the image modality alone does not. These results offer foundational insights and practical guidance for the multimodal recommendation community.

preprint2026arXiv

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit{\textbf{frame-wise 2-step setting}} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by $\sim$$4\times$. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .

preprint2026arXiv

Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions

In this paper, we introduce an underexplored problem in facial analysis: generating and recognizing multi-attribute natural language descriptions, containing facial action units (AUs), emotional states, and age estimation, for arbitrarily selected face regions (termed FaceFocalDesc). We argue that the system's ability to focus on individual facial areas leads to better understanding and control. To achieve this capability, we construct a new multi-attribute description dataset for arbitrarily selected face regions, providing rich region-level annotations and natural language descriptions. Further, we propose a fine-tuned vision-language model based on Qwen2.5-VL, called Focal-RegionFace for facial state analysis, which incrementally refines its focus on localized facial features through multiple progressively fine-tuning stages, resulting in interpretable age estimation, FAU and emotion detection. Experimental results show that Focal-RegionFace achieves the best performance on the new benchmark in terms of traditional and widely used metrics, as well as new proposed metrics. This fully verifies its effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.

preprint2022arXiv

Assembly development for the Simons Observatory focal plane readout module

The Simons Observatory (SO) is a suite of instruments sensitive to temperature and polarization of the cosmic microwave background (CMB) to be located at Cerro Toco in the Atacama Desert in Chile. Five telescopes, one large aperture telescope and four small aperture telescopes, will host roughly 70,000 highly multiplexed transition edge sensor (TES) detectors operated at 100 mK. Each SO focal plane module (UFM) couples 1,764 TESes to microwave resonators in a microwave multiplexing (uMux) readout circuit. Before detector integration, the 100 mK uMux components are packaged into multiplexing modules (UMMs), which are independently validated to ensure they meet SO performance specifications. Here we present the assembly developments of these UMM readout packages for mid frequency (90/150 GHz) and ultra high frequency (220/280 GHz) UFMs.

preprint2022arXiv

Maximum Likelihood Training for Score-Based Diffusion ODEs by High-Order Denoising Score Matching

Score-based generative models have excellent performance in terms of generation quality and likelihood. They model the data distribution by matching a parameterized score network with first-order data score functions. The score network can be used to define an ODE ("score-based diffusion ODE") for exact likelihood evaluation. However, the relationship between the likelihood of the ODE and the score matching objective is unclear. In this work, we prove that matching the first-order score is not sufficient to maximize the likelihood of the ODE, by showing a gap between the maximum likelihood and score matching objectives. To fill up this gap, we show that the negative likelihood of the ODE can be bounded by controlling the first, second, and third-order score matching errors; and we further present a novel high-order denoising score matching method to enable maximum likelihood training of score-based diffusion ODEs. Our algorithm guarantees that the higher-order matching error is bounded by the training error and the lower-order errors. We empirically observe that by high-order score matching, score-based diffusion ODEs achieve better likelihood on both synthetic data and CIFAR-10, while retaining the high generation quality.

preprint2022arXiv

Simons Observatory Focal-Plane Module: Detector Re-biasing With Bias-step Measurements

The Simons Observatory is a ground-based cosmic microwave background survey experiment that consists of three 0.5 m small-aperture telescopes and one 6 m large-aperture telescope, sited at an elevation of 5200 m in the Atacama Desert in Chile. SO will deploy 60,000 transition-edge sensor (TES) bolometers in 49 separate focal-plane modules across a suite of four telescopes covering 30/40 GHz low frequency (LF), 90/150 GHz mid frequency (MF), and 220/280 GHz ultra-high frequency (UHF). Each MF and UHF focal-plane module packages 1720 optical detectors spreading across 12 detector bias lines that provide voltage biasing to the detectors. During observation, detectors are subject to varying atmospheric emission and hence need to be re-biased accordingly. The re-biasing process includes measuring the detector properties such as the TES resistance and responsivity in a fast manner. Based on the result, detectors within one bias line then are biased with suitable voltage. Here we describe a technique for re-biasing detectors in the modules using the result from bias-step measurement.

preprint2022arXiv

The Simons Observatory 220 and 280 GHz Focal-Plane Module: Design and Initial Characterization

The Simons Observatory (SO) will detect and map the temperature and polarization of the millimeter-wavelength sky from Cerro Toco, Chile across a range of angular scales, providing rich data sets for cosmological and astrophysical analysis. The SO focal planes will be tiled with compact hexagonal packages, called Universal Focal-plane Modules (UFMs), in which the transition-edge sensor (TES) detectors are coupled to 100 mK microwave-multiplexing electronics. Three different types of dichroic TES detector arrays with bands centered at 30/40, 90/150, and 220/280 GHz will be implemented across the 49 planned UFMs. The 90/150GHz and 220/280 GHz arrays each contain 1,764 TESes, which are read out with two 910x multiplexer circuits. The modules contain a series of densely routed silicon chips, which are packaged together in a controlled electromagnetic environment with robust heat-sinking to 100 mK. Following an overview of the module design, we report on early results from the first 220/280GHz UFM, including detector yield, as well as readout and detector noise levels.