Source author record

Zhuo Chen

Zhuo Chen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

68works

39topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A three-dimensional multimode lumped-element resonator for collective spin manipulation and dispersive readout

We report a three-dimensional lumped-element multimode microwave resonator that enables homogeneous collective manipulation and dispersive readout of a macroscopic spin ensemble. By exploiting geometric symmetry, two antisymmetric modes with strongly suppressed cross-talk are engineered to spatially overlap and couple to the same ensemble at distinct frequencies. Using negatively charged nitrogen-vacancy centers in diamond at 28 mK, we observe collective strong coupling with a coupling strength of 5.0 MHz and demonstrate non-destructive dispersive readout via a detuned mode. The compact design, tunable coupling, and high field homogeneity make this resonator a versatile device for hybrid spin-photon systems and multimode solid-state quantum technologies.

preprint2026arXiv

Argus: Evidence Assembly for Scalable Deep Research Agents

Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

preprint2026arXiv

Binarisation-loophole-free observation of high-dimensional quantum nonlocality

Bell inequality tests based on high-dimensional entanglement usually require measurements that can resolve multiple possible outcomes. However, the implementation of high-dimensional multi-outcome measurements is often only emulated via a collection of ``click or no-click'' measurements. This reduction of multi-outcome measurements to binary-outcome measurements opens a loophole in high-dimensional tests Bell inequalities which can be exploited by local hidden variable models [Tavakoli et al., Phys. Rev. A 111, 042433 (2025)]. Here, we close this loophole by using four-dimensional photonic path-mode entanglement and multi-outcome detection. We test both the well-known Collins-Gisin-Linden-Massar-Popescu inequality and a related Bell inequality tailored for maximally entangled states in high-dimension. We observe violations that are large enough to also rule out any quantum model based on entanglement of lower dimension, thereby demonstrating genuinely high-dimensional nonlocality free of the binarisation loophole.

preprint2026arXiv

Hint Tuning: Less Data Makes Better Reasoners

Large reasoning models achieve high accuracy through extended chain-of-thought but generate 5--8 more tokens than necessary, applying verbose reasoning uniformly regardless of problem difficulty. We propose Hint Tuning, a data-efficient approach that teaches models to calibrate reasoning depth. Our key insight: the corresponding instruct model serves as an ideal difficulty probe. By testing what the instruct model can solve with varying guidance, we automatically construct training data across three states: No-Hint (direct answer), Sparse-Hint (minimal prefix), and Full-Hint (complete reasoning). This converts the abstract challenge of difficulty labeling into a measurable consistency check between the instruct and reasoning models. With only 1K self-annotated samples, Hint Tuning achieves 24--66% token reduction (31.5% average) across mainstream reasoning models (Qwen3-Thinking, DeepSeek-R1-Distill) at multiple scales (4B--32B) while maintaining competitive accuracy on five benchmarks. Unlike methods requiring massive distillation datasets or expensive RL, we achieve superior efficiency through simple alignment with the instruct model's capabilities.

preprint2026arXiv

LORE: A Large Generative Model for Search Relevance

Achievement. We introduce LORE, a systematic framework for Large Generative Model-based relevance in e-commerce search. Deployed and iterated over three years, LORE achieves a cumulative +27\% improvement in online GoodRate metrics. This report shares the valuable experience gained throughout its development lifecycle, spanning data, features, training, evaluation, and deployment. Insight. While existing works apply Chain-of-Thought (CoT) to enhance relevance, they often hit a performance ceiling. We argue this stems from treating relevance as a monolithic task, lacking principled deconstruction. Our key insight is that relevance comprises distinct capabilities: knowledge and reasoning, multi-modal matching, and rule adherence. We contend that a qualitative-driven decomposition is essential for breaking through current performance bottlenecks. Contributions. LORE provides a complete blueprint for the LLM relevance lifecycle. Key contributions include: (1) A two-stage training paradigm combining progressive CoT synthesis via SFT with human preference alignment via RL. (2) A comprehensive benchmark, RAIR, designed to evaluate these core capabilities. (3) A query frequency-stratified deployment strategy that efficiently transfers offline LLM capabilities to the online system. LORE serves as both a practical solution and a methodological reference for other vertical domains.

preprint2026arXiv

MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation

Autoregressive (AR) models can generate high-quality low-poly meshes from point clouds, but they still operate in an all-or-nothing manner: when a local region is unsatisfactory, the entire mesh must be regenerated, wasting computation and destroying satisfactory mesh structure elsewhere. We introduce MeshFIM, a Fill-in-the-Middle (FIM) framework that regenerates a target region of a low-poly mesh conditioned on the surrounding context. MeshFIM addresses three mesh-specific challenges: enforcing exact attachment along the exposed boundary, preserving topological order in the context, and suppressing overflow beyond the intended region. It does so with five complementary design choices: boundary vertex markers, context positional embeddings, expanded context width, context augmentation, and a low-poly geometry encoder whose gated subtraction mechanism focuses generation on the missing region by leveraging the difference between the reference surface and the existing mesh. Detailed ablation studies are presented to show the effectiveness of every introduced component. Based on MeshFIM, we demonstrate two applications: interactive brush-based editing and automatic defect repair on low-poly mesh (see Figure 1). Last but not least, experiments show that MeshFIM outperforms a range of baselines in mesh refinement, mesh repair and whole mesh generation plus stitch-back scheme.

preprint2026arXiv

Optimal Confidence Band for Kernel Gradient Flow Estimator

In this paper, we investigate the supremum-norm generalization error and the uniform inference for a specific class of kernel regression methods, namely the kernel gradient flows. Under the widely adopted capacity-source condition framework in the kernel regression literature, we first establish convergence rates for the supremum norm generalization error of both continuous and discrete kernel gradient flows under the source condition $s>α_0$, where $α_0\in(0,1)$ denotes the embedding index of the kernel function. Moreover, we show that these rates match the minimax optimal rates. Building on this result, we then construct simultaneous confidence bands for both continuous and discrete kernel gradient flows. Notably, the widths of the proposed confidence bands are also optimal, in the sense that their shrinkage rates are greater than, while can be arbitrarily close to, the minimax optimal rates.

preprint2026arXiv

PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

Synthesizing physics-grounded 3D assets is a critical bottleneck for interactive virtual worlds and embodied AI. Existing methods predominantly focus on static geometry, overlooking the functional properties essential for interaction. We propose that interactive asset generation must be rooted in functional logic and hierarchical physics. To bridge this gap, we introduce PhysForge, a decoupled two-stage framework supported by PhysDB, a large-scale dataset of 150,000 assets with four-tier physical annotations. First, a VLM acts as a "physical architect" to plan a "Hierarchical Physical Blueprint" defining material, functional, and kinematic constraints. Second, a physics-grounded diffusion model realizes this blueprint by synthesizing high-fidelity geometry alongside precise kinematic parameters via a novel KineVoxel Injection (KVI) mechanism. Experiments demonstrate that PhysForge produces functionally plausible, simulation-ready assets, providing a robust data engine for interactive 3D content and embodied agents.

preprint2026arXiv

Temporal Knowledge Graph Question Answering: A Survey

Knowledge Base Question Answering (KBQA) has been a long-standing field to answer questions based on knowledge bases. Recently, the evolving dynamics of knowledge have attracted a growing interest in Temporal Knowledge Graph Question Answering (TKGQA), an emerging task to answer temporal questions. However, this field grapples with ambiguities in defining temporal questions and lacks a systematic categorization of existing methods for TKGQA. In response, this paper provides a thorough survey from two perspectives: the taxonomy of temporal questions and the methodological categorization for TKGQA. Specifically, we first establish a detailed taxonomy of temporal questions engaged in prior studies. Subsequently, we provide a comprehensive review of TKGQA techniques of two categories: semantic parsing-based and TKG embedding-based. Building on this review, the paper outlines potential research directions aimed at advancing the field of TKGQA. This work aims to serve as a comprehensive reference for TKGQA and to stimulate further research.

preprint2026arXiv

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

Scientific discovery is increasingly constrained by costly experiments and limited resources, underscoring the need for efficient optimization in AI for science. Bayesian Optimization (BO), though widely adopted for balancing exploration and exploitation, often exhibits slow cold-start performance and poor scalability in high-dimensional settings, limiting its applicability in real-world scientific problems. To overcome these challenges, we propose LLM-Guided Bayesian Optimization (LGBO), the first LLM preference-guided BO framework that continuously integrates the semantic reasoning of large language models (LLMs) into the optimization loop. Unlike prior works that use LLMs only for warm-start initialization or candidate generation, LGBO introduces a region-lifted preference mechanism that embeds LLM-driven preferences into every iteration, shifting the surrogate mean in a stable and controllable way. Theoretically, we prove that LGBO does not perform significantly worse than standard BO in the worst case, while achieving significantly faster convergence when preferences align with the objective. Empirically, LGBO consistently outperforms existing methods across diverse dry benchmarks in physics, chemistry, biology, and materials science. Most notably, in a new wet-lab optimization of Fe-Cr battery electrolytes, LGBO attains \textbf{90\% of the best observed value within 6 iterations}, whereas standard BO and existing LLM-augmented baselines require more than 10. Together, these results suggest that LGBO offers a promising direction for integrating LLMs into scientific optimization workflows.

preprint2023arXiv

Dipolar Spin Liquid Ending with Quantum Critical Point in a Gd-based Triangular Magnet

By performing experiment and model studies on a triangular-lattice dipolar magnet KBaGd(BO$_3$)$_2$ (KBGB), we find the highly frustrated magnet with a planar anisotropy hosts a strongly fluctuating dipolar spin liquid (DSL), which originates from the intriguing interplay between dipolar and Heisenberg interactions. The DSL constitutes an extended regime in the field-temperature phase diagram, which gets lowered in temperature as field increases and eventually ends with an unconventional quantum critical point (QCP) at $B_c\simeq 0.75$~T. Based on dipolar Heisenberg model calculations, we identify the DSL as a Berezinskii-Kosterlitz-Thouless (BKT) phase with emergent U(1) symmetry. Due to the tremendous entropy accumulation that can be related to the strong BKT and quantum fluctuations, unprecedented magnetic cooling effects are observed in the DSL regime and particularly near the QCP, making KBGB a superior dipolar coolant to commercial Gd-based refrigerants. We establish the phase diagram for triangular-lattice dipolar quantum magnets where emergent symmetry plays an essential role, and provide a basis and opens an avenue for their applications in sub-Kelvin refrigeration.

preprint2023arXiv

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

preprint2022arXiv

A GLSM view on Homological Projective Duality

Given a gauged linear sigma model (GLSM) $\mathcal{T}_{X}$ realizing a projective variety $X$ in one of its phases, i.e. its quantum Kähler moduli has a maximally unipotent point, we propose an \emph{extended} GLSM $\mathcal{T}_{\mathcal{X}}$ realizing the homological projective dual category $\mathcal{C}$ to $D^{b}Coh(X)$ as the category of B-branes of the Higgs branch of one of its phases. In most of the cases, the models $\mathcal{T}_{X}$ and $\mathcal{T}_{\mathcal{X}}$ are anomalous and the analysis of their Coulomb and mixed Coulomb-Higgs branches gives information on the semiorthogonal/Lefschetz decompositions of $\mathcal{C}$ and $D^{b}Coh(X)$. We also study the models $\mathcal{T}_{X_{L}}$ and $\mathcal{T}_{\mathcal{X}_{L}}$ that correspond to homological projective duality of linear sections $X_{L}$ of $X$. This explains why, in many cases, two phases of a GLSM are related by homological projective duality. We study mostly abelian examples: linear and Veronese embeddings of $\mathbb{P}^{n}$ and Fano complete intersections in $\mathbb{P}^{n}$. In such cases, we are able to reproduce known results as well as produce some new conjectures. In addition, we comment on the construction of the HPD to a nonabelian GLSM for the Plücker embedding of the Grassmannian $G(k,N)$.

preprint2022arXiv

Collaboration of Experts: Achieving 80% Top-1 Accuracy on ImageNet with 100M FLOPs

In this paper, we propose a Collaboration of Experts (CoE) framework to pool together the expertise of multiple networks towards a common aim. Each expert is an individual network with expertise on a unique portion of the dataset, which enhances the collective capacity. Given a sample, an expert is selected by the delegator, which simultaneously outputs a rough prediction to support early termination. To fulfill this framework, we propose three modules to impel each model to play its role, namely weight generation module (WGM), label generation module (LGM) and variance calculation module (VCM). Our method achieves the state-of-the-art performance on ImageNet, 80.7% top-1 accuracy with 194M FLOPs. Combined with PWLU activation function and CondConv, CoE further achieves the accuracy of 80.0% with only 100M FLOPs for the first time. More importantly, our method is hardware friendly and achieves a 3-6x speedup compared with some existing conditional computation approaches.

preprint2022arXiv

Continuous Streaming Multi-Talker ASR with Dual-path Transducers

Streaming recognition of multi-talker conversations has so far been evaluated only for 2-speaker single-turn sessions. In this paper, we investigate it for multi-turn meetings containing multiple speakers using the Streaming Unmixing and Recognition Transducer (SURT) model, and show that naively extending the single-turn model to this harder setting incurs a performance penalty. As a solution, we propose the dual-path (DP) modeling strategy first used for time-domain speech separation. We experiment with LSTM and Transformer based DP models, and show that they improve word error rate (WER) performance while yielding faster convergence. We also explore training strategies such as chunk width randomization and curriculum learning for these models, and demonstrate their importance through ablation studies. Finally, we evaluate our models on the LibriCSS meeting data, where they perform competitively with offline separation-based methods.

preprint2022arXiv

Dirac generating operators of split Courant algebroids

Given a vector bundle $A$ over a smooth manifold $M$ such that the square root $\mathcal{L}$ of the line bundle $\wedge^{\mathrm{top}}A^\ast \otimes \wedge^{\mathrm{top}}T^\ast M$ exists, the Clifford bundle associated to the split pseudo-Euclidean vector bundle $(E = A \oplus A^\ast, \langle \cdot, \cdot \rangle)$, admits a spinor bundle $\wedge^\bullet A \otimes \mathcal{L}$, whose section space can be thought of as that of Berezinian half-densities of the graded manifold $A^\ast[1]$. We give an explicit construction of Dirac generating operators of split Courant algebroid (or proto-bialgebroid) structures on $A \oplus A^\ast$ introduced by Alekseev and Xu. We also prove that the square of the Dirac generating operator gives rise to an invariant of the split Courant algebroid.

preprint2022arXiv

Disentangled Ontology Embedding for Zero-shot Learning

Knowledge Graph (KG) and its variant of ontology have been widely used for knowledge representation, and have shown to be quite effective in augmenting Zero-shot Learning (ZSL). However, existing ZSL methods that utilize KGs all neglect the intrinsic complexity of inter-class relationships represented in KGs. One typical feature is that a class is often related to other classes in different semantic aspects. In this paper, we focus on ontologies for augmenting ZSL, and propose to learn disentangled ontology embeddings guided by ontology properties to capture and utilize more fine-grained class relationships in different aspects. We also contribute a new ZSL framework named DOZSL, which contains two new ZSL solutions based on generative models and graph propagation models, respectively, for effectively utilizing the disentangled ontology embeddings. Extensive evaluations have been conducted on five benchmarks across zero-shot image classification (ZS-IMGC) and zero-shot KG completion (ZS-KGC). DOZSL often achieves better performance than the state-of-the-art, and its components have been verified by ablation studies and case studies. Our codes and datasets are available at https://github.com/zjukg/DOZSL.

preprint2022arXiv

Molecular Contrastive Learning with Chemical Element Knowledge Graph

Molecular representation learning contributes to multiple downstream tasks such as molecular property prediction and drug design. To properly represent molecules, graph contrastive learning is a promising paradigm as it utilizes self-supervision signals and has no requirements for human annotations. However, prior works fail to incorporate fundamental domain knowledge into graph semantics and thus ignore the correlations between atoms that have common attributes but are not directly connected by bonds. To address these issues, we construct a Chemical Element Knowledge Graph (KG) to summarize microscopic associations between elements and propose a novel Knowledge-enhanced Contrastive Learning (KCL) framework for molecular representation learning. KCL framework consists of three modules. The first module, knowledge-guided graph augmentation, augments the original molecular graph based on the Chemical Element KG. The second module, knowledge-aware graph representation, extracts molecular representations with a common graph encoder for the original molecular graph and a Knowledge-aware Message Passing Neural Network (KMPNN) to encode complex information in the augmented molecular graph. The final module is a contrastive objective, where we maximize agreement between these two views of molecular graphs. Extensive experiments demonstrated that KCL obtained superior performances against state-of-the-art baselines on eight molecular datasets. Visualization experiments properly interpret what KCL has learned from atoms and attributes in the augmented molecular graphs. Our codes and data are available at https://github.com/ZJU-Fangyin/KCL.

preprint2022arXiv

Planetary Accretion Shocks with a Realistic Equation of State

The final stage of gas giant formation involves accreting gas from the parent protoplanetary disk. In general, the infalling gas likely approaches a free-fall velocity, creating an accretion shock, leading to strong shock heating and radiation. We investigate the kinematics and energetics of such accretion shocks using 1D radiation hydrodynamic simulations. Our simulations feature the first self-consistent treatment of hydrogen dissociation and ionization, radiation transport, and realistic grey opacity. By exploring a broad range of giant planet masses (0.1-3 M$_{J}$) and accretion rates ($10^{-3}$-$10^{-2}$M$_{\oplus}\cdot\rm{yr}^{-1}$), we focus on global shock efficiency and the final entropy of the accreted gas. We find that radiation from the accretion shock can fully disassociate the molecular hydrogen of the incoming gas when the shock luminosity is above a critical luminosity. Meanwhile, the post-shock entropy generally fall into "cold" ($<12k_{\rm{B}}/m_{\rm H}$) and "hot" ($>16k_{\rm{B}}/m_{\rm H}$) groups which depends on the extent of the endothermic process of $\rm{H}_2$ dissociation. While 2D or 3D simulations are needed for more realistic understandings of the accretion process, this distinction likely carries over and sheds light on the interpretation of young direct imaging planets.

preprint2022arXiv

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of ``virtual'' output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.

preprint2022arXiv

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize ``who spoke what'' with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities, we propose an encoder-decoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from non-overlapping speech but also from overlapping speech. The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency. We evaluate the proposed model for a joint task of ASR and SID/SD by using LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.

preprint2022arXiv

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

This paper presents Transcribe-to-Diarize, a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR). The E2E SA-ASR is a joint model that was recently proposed for speaker counting, multi-talker speech recognition, and speaker identification from monaural audio that contains overlapping speech. Although the E2E SA-ASR model originally does not estimate any time-related information, we show that the start and end times of each word can be estimated with sufficient accuracy from the internal state of the E2E SA-ASR by adding a small number of learnable parameters. Similar to the target-speaker voice activity detection (TS-VAD)-based diarization method, the E2E SA-ASR model is applied to estimate speech activity of each speaker while it has the advantages of (i) handling unlimited number of speakers, (ii) leveraging linguistic information for speaker diarization, and (iii) simultaneously generating speaker-attributed transcriptions. Experimental results on the LibriCSS and AMI corpora show that the proposed method achieves significantly better diarization error rate than various existing speaker diarization methods when the number of speakers is unknown, and achieves a comparable performance to TS-VAD when the number of speakers is given in advance. The proposed method simultaneously generates speaker-attributed transcription with state-of-the-art accuracy.

preprint2022arXiv

Ultra Fast Speech Separation Model with Teacher Student Learning

Transformer has been successfully applied to speech separation recently with its strong long-dependency modeling capacity using a self-attention mechanism. However, Transformer tends to have heavy run-time costs due to the deep encoder layers, which hinders its deployment on edge devices. A small Transformer model with fewer encoder layers is preferred for computational efficiency, but it is prone to performance degradation. In this paper, an ultra fast speech separation Transformer model is proposed to achieve both better performance and efficiency with teacher student learning (T-S learning). We introduce layer-wise T-S learning and objective shifting mechanisms to guide the small student model to learn intermediate representations from the large teacher model. Compared with the small Transformer model trained from scratch, the proposed T-S learning method reduces the word error rate (WER) by more than 5% for both multi-channel and single-channel speech separation on LibriCSS dataset. Utilizing more unlabeled speech data, our ultra fast speech separation models achieve more than 10% relative WER reduction.

preprint2022arXiv

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Recently, self-supervised learning (SSL) has demonstrated strong performance in speaker recognition, even if the pre-training objective is designed for speech recognition. In this paper, we study which factor leads to the success of self-supervised learning on speaker-related tasks, e.g. speaker verification (SV), through a series of carefully designed experiments. Our empirical results on the Voxceleb-1 dataset suggest that the benefit of SSL to SV task is from a combination of mask speech prediction loss, data scale, and model size, while the SSL quantizer has a minor impact. We further employ the integrated gradients attribution method and loss landscape visualization to understand the effectiveness of self-supervised learning for speaker recognition performance.

preprint2021arXiv

Continuous Speech Separation with Ad Hoc Microphone Arrays

Speech separation has been shown effective for multi-talker speech recognition. Under the ad hoc microphone array setup where the array consists of spatially distributed asynchronous microphones, additional challenges must be overcome as the geometry and number of microphones are unknown beforehand. Prior studies show, with a spatial-temporalinterleaving structure, neural networks can efficiently utilize the multi-channel signals of the ad hoc array. In this paper, we further extend this approach to continuous speech separation. Several techniques are introduced to enable speech separation for real continuous recordings. First, we apply a transformer-based network for spatio-temporal modeling of the ad hoc array signals. In addition, two methods are proposed to mitigate a speech duplication problem during single talker segments, which seems more severe in the ad hoc array scenarios. One method is device distortion simulation for reducing the acoustic mismatch between simulated training data and real recordings. The other is speaker counting to detect the single speaker segments and merge the output signal channels. Experimental results for AdHoc-LibiCSS, a new dataset consisting of continuous recordings of concatenated LibriSpeech utterances obtained by multiple different devices, show the proposed separation method can significantly improve the ASR accuracy for overlapped speech with little performance degradation for single talker segments.

preprint2021arXiv

Dual-Path Modeling for Long Recording Speech Separation in Meetings

The continuous speech separation (CSS) is a task to separate the speech sources from a long, partially overlapped recording, which involves a varying number of speakers. A straightforward extension of conventional utterance-level speech separation to the CSS task is to segment the long recording with a size-fixed window and process each window separately. Though effective, this extension fails to model the long dependency in speech and thus leads to sub-optimum performance. The recent proposed dual-path modeling could be a remedy to this problem, thanks to its capability in jointly modeling the cross-window dependency and the local-window processing. In this work, we further extend the dual-path modeling framework for CSS task. A transformer-based dual-path system is proposed, which integrates transform layers for global modeling. The proposed models are applied to LibriCSS, a real recorded multi-talk dataset, and consistent WER reduction can be observed in the ASR evaluation for separated speech. Also, a dual-path transformer equipped with convolutional layers is proposed. It significantly reduces the computation amount by 30% with better WER evaluation. Furthermore, the online processing dual-path models are investigated, which shows 10% relative WER reduction compared to the baseline.

preprint2021arXiv

Measuring the alpha-abundance of subsolar-metallicity stars in the Milky Way's central half-parsec: testing globular cluster and dwarf galaxy infall scenarios

While the Milky Way Nuclear star cluster has been studied extensively, how it formed is uncertain. Studies have shown it contains a solar and supersolar metallicity population that may have formed in-situ, along with a subsolar metallicity population that may have formed via mergers of globular clusters and dwarf galaxies. Stellar abundance measurements are critical to differentiate between formation scenarios. We present new measurements of [$M/H$] and $α$-element abundances [$α/Fe$] of two subsolar-metallicity stars in the Galactic Center. These observations were taken with the adaptive-optics assisted high-resolution (R=24,000) spectrograph NIRSPEC in the K-band (1.8 - 2.6 micron). These are the first $α$-element abundance measurements of sub-solar metallicity stars in the Milky Way nuclear star cluster. We measure [$M/H$]=$-0.59\pm 0.11$, [$α/Fe$]=$0.05\pm 0.15$ and [$M/H$]= $-0.81\pm 0.12$, [$α/Fe$]= $0.15\pm 0.16$ for the two stars at the Galactic center; the uncertainties are dominated by systematic uncertainties in the spectral templates. The stars have an [$α/Fe$] in-between the [$α/Fe$] of globular clusters and dwarf galaxies at similar [$M/H$] values. Their abundances are very different than the bulk of the stars in the nuclear star cluster. These results indicate that the sub-solar metallicity population in the Milky Way nuclear star cluster likely originated from infalling dwarf galaxies or globular clusters and are unlikely to have formed in-situ.

preprint2021arXiv

OntoZSL: Ontology-enhanced Zero-shot Learning

Zero-shot Learning (ZSL), which aims to predict for those classes that have never appeared in the training data, has arisen hot research interests. The key of implementing ZSL is to leverage the prior knowledge of classes which builds the semantic relationship between classes and enables the transfer of the learned models (e.g., features) from training classes (i.e., seen classes) to unseen classes. However, the priors adopted by the existing methods are relatively limited with incomplete semantics. In this paper, we explore richer and more competitive prior knowledge to model the inter-class relationship for ZSL via ontology-based knowledge representation and semantic embedding. Meanwhile, to address the data imbalance between seen classes and unseen classes, we developed a generative ZSL framework with Generative Adversarial Networks (GANs). Our main findings include: (i) an ontology-enhanced ZSL framework that can be applied to different domains, such as image classification (IMGC) and knowledge graph completion (KGC); (ii) a comprehensive evaluation with multiple zero-shot datasets from different domains, where our method often achieves better performance than the state-of-the-art models. In particular, on four representative ZSL baselines of IMGC, the ontology-based class semantics outperform the previous priors e.g., the word embeddings of classes by an average of 12.4 accuracy points in the standard ZSL across two example datasets (see Figure 4).

Zhuo Chen

What is connected

Connect this record

See the researcher in context

Building this map preview

68 published item(s)

A three-dimensional multimode lumped-element resonator for collective spin manipulation and dispersive readout

Argus: Evidence Assembly for Scalable Deep Research Agents

Binarisation-loophole-free observation of high-dimensional quantum nonlocality

Hint Tuning: Less Data Makes Better Reasoners

LORE: A Large Generative Model for Search Relevance

MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation

Optimal Confidence Band for Kernel Gradient Flow Estimator

PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

Temporal Knowledge Graph Question Answering: A Survey

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

Dipolar Spin Liquid Ending with Quantum Critical Point in a Gd-based Triangular Magnet

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

A GLSM view on Homological Projective Duality

Collaboration of Experts: Achieving 80% Top-1 Accuracy on ImageNet with 100M FLOPs

Continuous Streaming Multi-Talker ASR with Dual-path Transducers

Dirac generating operators of split Courant algebroids

Disentangled Ontology Embedding for Zero-shot Learning

Molecular Contrastive Learning with Chemical Element Knowledge Graph

Planetary Accretion Shocks with a Realistic Equation of State

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

Ultra Fast Speech Separation Model with Teacher Student Learning

Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?

Continuous Speech Separation with Ad Hoc Microphone Arrays

Dual-Path Modeling for Long Recording Speech Separation in Meetings

Measuring the alpha-abundance of subsolar-metallicity stars in the Milky Way's central half-parsec: testing globular cluster and dwarf galaxy infall scenarios

OntoZSL: Ontology-enhanced Zero-shot Learning

A 3D radiation-hydrodynamic AGB binary model

A Modular Interpretation of BBGS Towers

An End-to-end Architecture of Online Multi-channel Speech Separation

Bipolar Planetary Nebulae from Outflow Collimation by Common Envelope Evolution

Continuous speech separation: dataset and analysis

Deep Multi-Task Learning for Cooperative NOMA: System Design and Principles

Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation

End-to-end Microphone Permutation and Number Invariant Multi-channel Speech Separation

Generative Adversarial Zero-shot Learning via Knowledge Graphs

Investigation of End-To-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings

Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers

Justifications for Goal-Directed Constraint Answer Set Programming

Neural Speech Separation Using Spatially Distributed Microphones

Spectrum Intelligent Radio: Technology, Development, and Future Trends

Task Offloading for Large-Scale Asynchronous Mobile Edge Computing: An Index Policy Approach

ViP: Virtual Pooling for Accelerating CNN-based Image Classification and Object Detection

Kapranov's construction of sh Leibniz algebras

Accretion in Common Envelope Evolution

Shifted derived Poisson manifolds associated with Lie pairs

End-to-End Attention based Text-Dependent Speaker Verification

A Physician Advisory System for Chronic Heart Failure Management Based on Knowledge Patterns

Holomorphic Poisson Structures and its Cohomology on Nilmanifolds

Single-Channel Multi-Speaker Separation using Deep Clustering

The Creation of AGB Fallback Shells

Toda-like (0,2) mirrors to products of projective spaces

A Novel Approach for Clone Group Mapping by using Topic Modeling

Deep clustering: Discriminative embeddings for segmentation and separation

A Hopf algebra associated to a Lie pair

Holomorphic Poisson Cohomology

Weak Lie 2-bialgebra

Advanced Asymmetrical Supercapacitors Based on Graphene Hybrid Materials

Dirac structures of omni-Lie algebroids

E-Courant algebroids

Evolution of Cooperation among Mobile Agents

Near-Infrared Fluorescence Enhanced (NIR-FE) Molecular Imaging of Live Cells on Gold Substrates

On Double Vector Bundles

A Novel Clustering Algorithm Based Upon Games on Evolving Network

Evolutionary Prisoner's Dilemma Game in Flocks

Multiplexed five-color molecular imaging of cancer cells and tumor tissues with carbon nanotube Raman tags in the near-infrared

TiO2 Nanocrystals Grown on Graphene as Advanced Photocatalytic Hybrid Materials