Source author record

Kaiwen Zheng

Kaiwen Zheng appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

astro-ph.IM Computer Vision astro-ph.CO Information Retrieval Machine Learning

Catalog footprint

What is connected

7works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Are Multimodal Embeddings Truly Beneficial for Recommendation? A Deep Dive into Whole vs. Individual Modalities

Multimodal recommendation has emerged as a mainstream paradigm, typically leveraging text and visual embeddings extracted from pre-trained models such as Sentence-BERT, Vision Transformers, and ResNet. This approach is founded on the intuitive assumption that incorporating multimodal embeddings can enhance recommendation performance. However, despite its popularity, this assumption lacks comprehensive empirical verification. This presents a critical research gap. To address it, we pose the central research question of this paper: Are multimodal embeddings truly beneficial for recommendation? To answer this question, we conduct a large-scale empirical study examining the role of text and visual embeddings in modern multimodal recommendation models, both as a whole and individually. Specifically, we pose two key research questions: (1) Do multimodal embeddings as a whole improve recommendation performance? (2) Is each individual modality - text and image - useful when used alone? To isolate the effect of individual modalities - text or visual - we employ a modality knockout strategy by setting the corresponding embeddings to either constant values or random noise. To ensure the scale and comprehensiveness of our study, we evaluate 14 widely used state-of-the-art multimodal recommendation models. Our findings reveal that: (1) multimodal embeddings generally enhance recommendation performance - particularly when integrated through more sophisticated graph-based fusion models. Surprisingly, commonly adopted baseline models with simple fusion schemes, such as VBPR and BM3, show only limited gains. (2) The text modality alone achieves performance comparable to the full multimodal setting in most cases, whereas the image modality alone does not. These results offer foundational insights and practical guidance for the multimodal recommendation community.

preprint2026arXiv

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbf{Causal Forcing++}, a principled and scalable pipeline that uses \emph{causal consistency distillation} (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit{\textbf{frame-wise 2-step setting}} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by $\sim$$4\times$. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .

preprint2026arXiv

Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions

In this paper, we introduce an underexplored problem in facial analysis: generating and recognizing multi-attribute natural language descriptions, containing facial action units (AUs), emotional states, and age estimation, for arbitrarily selected face regions (termed FaceFocalDesc). We argue that the system's ability to focus on individual facial areas leads to better understanding and control. To achieve this capability, we construct a new multi-attribute description dataset for arbitrarily selected face regions, providing rich region-level annotations and natural language descriptions. Further, we propose a fine-tuned vision-language model based on Qwen2.5-VL, called Focal-RegionFace for facial state analysis, which incrementally refines its focus on localized facial features through multiple progressively fine-tuning stages, resulting in interpretable age estimation, FAU and emotion detection. Experimental results show that Focal-RegionFace achieves the best performance on the new benchmark in terms of traditional and widely used metrics, as well as new proposed metrics. This fully verifies its effectiveness and versatility in fine-grained multi-attribute face region-focal analysis scenarios.

preprint2022arXiv

Assembly development for the Simons Observatory focal plane readout module

The Simons Observatory (SO) is a suite of instruments sensitive to temperature and polarization of the cosmic microwave background (CMB) to be located at Cerro Toco in the Atacama Desert in Chile. Five telescopes, one large aperture telescope and four small aperture telescopes, will host roughly 70,000 highly multiplexed transition edge sensor (TES) detectors operated at 100 mK. Each SO focal plane module (UFM) couples 1,764 TESes to microwave resonators in a microwave multiplexing (uMux) readout circuit. Before detector integration, the 100 mK uMux components are packaged into multiplexing modules (UMMs), which are independently validated to ensure they meet SO performance specifications. Here we present the assembly developments of these UMM readout packages for mid frequency (90/150 GHz) and ultra high frequency (220/280 GHz) UFMs.

preprint2022arXiv

Maximum Likelihood Training for Score-Based Diffusion ODEs by High-Order Denoising Score Matching

Score-based generative models have excellent performance in terms of generation quality and likelihood. They model the data distribution by matching a parameterized score network with first-order data score functions. The score network can be used to define an ODE ("score-based diffusion ODE") for exact likelihood evaluation. However, the relationship between the likelihood of the ODE and the score matching objective is unclear. In this work, we prove that matching the first-order score is not sufficient to maximize the likelihood of the ODE, by showing a gap between the maximum likelihood and score matching objectives. To fill up this gap, we show that the negative likelihood of the ODE can be bounded by controlling the first, second, and third-order score matching errors; and we further present a novel high-order denoising score matching method to enable maximum likelihood training of score-based diffusion ODEs. Our algorithm guarantees that the higher-order matching error is bounded by the training error and the lower-order errors. We empirically observe that by high-order score matching, score-based diffusion ODEs achieve better likelihood on both synthetic data and CIFAR-10, while retaining the high generation quality.

preprint2022arXiv

Simons Observatory Focal-Plane Module: Detector Re-biasing With Bias-step Measurements

The Simons Observatory is a ground-based cosmic microwave background survey experiment that consists of three 0.5 m small-aperture telescopes and one 6 m large-aperture telescope, sited at an elevation of 5200 m in the Atacama Desert in Chile. SO will deploy 60,000 transition-edge sensor (TES) bolometers in 49 separate focal-plane modules across a suite of four telescopes covering 30/40 GHz low frequency (LF), 90/150 GHz mid frequency (MF), and 220/280 GHz ultra-high frequency (UHF). Each MF and UHF focal-plane module packages 1720 optical detectors spreading across 12 detector bias lines that provide voltage biasing to the detectors. During observation, detectors are subject to varying atmospheric emission and hence need to be re-biased accordingly. The re-biasing process includes measuring the detector properties such as the TES resistance and responsivity in a fast manner. Based on the result, detectors within one bias line then are biased with suitable voltage. Here we describe a technique for re-biasing detectors in the modules using the result from bias-step measurement.

preprint2022arXiv

The Simons Observatory 220 and 280 GHz Focal-Plane Module: Design and Initial Characterization

The Simons Observatory (SO) will detect and map the temperature and polarization of the millimeter-wavelength sky from Cerro Toco, Chile across a range of angular scales, providing rich data sets for cosmological and astrophysical analysis. The SO focal planes will be tiled with compact hexagonal packages, called Universal Focal-plane Modules (UFMs), in which the transition-edge sensor (TES) detectors are coupled to 100 mK microwave-multiplexing electronics. Three different types of dichroic TES detector arrays with bands centered at 30/40, 90/150, and 220/280 GHz will be implemented across the 49 planned UFMs. The 90/150GHz and 220/280 GHz arrays each contain 1,764 TESes, which are read out with two 910x multiplexer circuits. The modules contain a series of densely routed silicon chips, which are packaged together in a controlled electromagnetic environment with robust heat-sinking to 100 mK. Following an overview of the module design, we report on early results from the first 220/280GHz UFM, including detector yield, as well as readout and detector noise levels.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint

Fields this researcher appears in

astro-ph.IM Computer Vision astro-ph.CO Information Retrieval Machine Learning

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2601.00156:author:1:kaiwen-zheng

Imported May 21, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.15141:author:3:kaiwen-zheng

Imported May 20, 2026Synced May 20, 2026

3 works

Erin Healy

Researcher

Erin Healy contributes to research discovery and scholarly infrastructure.

Open to collaborate

3 works

Steve K. Choi

Researcher

Steve K. Choi contributes to research discovery and scholarly infrastructure.

Open to collaborate

3 works

Yuhan Wang

Researcher

Yuhan Wang contributes to research discovery and scholarly infrastructure.

Open to collaborate

2 works

Bradley R. Johnson

Researcher

Bradley R. Johnson contributes to research discovery and scholarly infrastructure.

Open to collaborate

Kaiwen Zheng

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

Are Multimodal Embeddings Truly Beneficial for Recommendation? A Deep Dive into Whole vs. Individual Modalities

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

Focal-RegionFace: Generating Fine-Grained Multi-attribute Descriptions for Arbitrarily Selected Face Focal Regions

Assembly development for the Simons Observatory focal plane readout module

Maximum Likelihood Training for Score-Based Diffusion ODEs by High-Order Denoising Score Matching

Simons Observatory Focal-Plane Module: Detector Re-biasing With Bias-step Measurements

The Simons Observatory 220 and 280 GHz Focal-Plane Module: Design and Initial Characterization