Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
21works
0followers
20topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

21 published item(s)

preprint2026arXiv

A microscopic origin for the breakdown of the Stokes Einstein relation in ion transport

Ion transport underlies the operation of biological ion channels and governs the performance of electrochemical energy-storage devices. A long-standing anomaly is that smaller alkali metal ions, such as Li$^+$, migrate more slowly in water than larger ions, in apparent violation of the Stokes-Einstein relation. This breakdown is conventionally attributed to dielectric friction, a collective drag force arising from electrostatic interactions between a drifting ion and its surrounding solvent. Here, combining nanopore transport measurements over electric fields spanning several orders of magnitude with molecular dynamics simulations, we show that the time-averaged electrostatic force on a migrating ion is not a drag force but a net driving force. By contrasting charged ions with neutral particles, we reveal that ionic charge introduces additional Lorentzian peaks in the frequency-dependent friction coefficient. These peaks originate predominantly from short-range Lennard-Jones (LJ) interactions within the first hydration layer and represent additional channels for energy dissipation, strongest for Li$^+$ and progressively weaker for Na$^+$ and K$^+$. Our results demonstrate that electrostatic interactions primarily act to tighten the local hydration structure, thereby amplifying short-range LJ interactions rather than directly opposing ion motion. This microscopic mechanism provides a unified physical explanation for the breakdown of the Stokes-Einstein relation in aqueous ion transport.

preprint2026arXiv

Adversarial Instance Generation and Robust Training for Neural Combinatorial Optimization with Multiple Objectives

Deep reinforcement learning (DRL) has shown great promise in addressing multi-objective combinatorial optimization problems (MOCOPs). Nevertheless, the robustness of these learning-based solvers has remained insufficiently explored, especially across diverse and complex problem distributions. In this paper, we propose a unified robustness-oriented framework for preference-conditioned DRL solvers for MOCOPs. Within this framework, we develop a preference-based adversarial attack to generate hard instances that expose solver weaknesses, and quantify the attack impact by the resulting degradation on Pareto-front quality. We further introduce a defense strategy that integrates hardness-aware preference selection into adversarial training to reduce overfitting to restricted preference regions and improve out-of-distribution performance. The experimental results on multi-objective traveling salesman problem (MOTSP), multi-objective capacitated vehicle routing problem (MOCVRP), and multi-objective knapsack problem (MOKP) verify that our attack method successfully learns hard instances for different solvers. Furthermore, our defense method significantly strengthens the robustness and generalizability of neural solvers, delivering superior performance on hard or out-of-distribution instances.

preprint2026arXiv

Beyond Seen Bounds: Class-Centric Polarization for Single-Domain Generalized Deep Metric Learning

Single-domain generalized deep metric learning (SDG-DML) faces the dual challenge of both category and domain shifts during testing, limiting real-world applications. Therefore, aiming to learn better generalization ability on both unseen categories and domains is a realistic goal for the SDG-DML task. To deliver the aspiration, existing SDG-DML methods employ the domain expansion-equalization strategy to expand the source data and generate out-of-distribution samples. However, these methods rely on proxy-based expansion, which tends to generate samples clustered near class proxies, failing to simulate the broad and distant domain shifts encountered in practice. To alleviate the problem, we propose CenterPolar, a novel SDG-DML framework that dynamically expands and constrains domain distributions to learn a generalizable DML model for wider target domain distributions. Specifically, \textbf{CenterPolar} contains two collaborative class-centric polarization phases: (1) Class-Centric Centrifugal Expansion ($C^3E$) and (2) Class-Centric Centripetal Constraint ($C^4$). In the first phase, $C^3E$ drives the source domain distribution by shifting the source data away from class centroids using centrifugal expansion to generalize to more unseen domains. In the second phase, to consolidate domain-invariant class information for the generalization ability to unseen categories, $C^4$ pulls all seen and unseen samples toward their class centroids while enforcing inter-class separation via centripetal constraint. Extensive experimental results on widely used CUB-200-2011 Ext., Cars196 Ext., DomainNet, PACS, and Office-Home datasets demonstrate the superiority and effectiveness of our CenterPolar over existing state-of-the-art methods. The code will be released after acceptance.

preprint2026arXiv

Controllable Video Generation: A Survey

With the rapid development of AI-generated content (AIGC), video generation has emerged as one of its most dynamic and impactful subfields. In particular, the advancement of video generation foundation models has led to growing demand for controllable video generation methods that can more accurately reflect user intent. Most existing foundation models are designed for text-to-video generation, where text prompts alone are often insufficient to express complex, multi-modal, and fine-grained user requirements. This limitation makes it challenging for users to generate videos with precise control using current models. To address this issue, recent research has explored the integration of additional non-textual conditions, such as camera motion, depth maps, and human pose, to extend pretrained video generation models and enable more controllable video synthesis. These approaches aim to enhance the flexibility and practical applicability of AIGC-driven video generation systems. In this survey, we provide a systematic review of controllable video generation, covering both theoretical foundations and recent advances in the field. We begin by introducing the key concepts and commonly used open-source video generation models. We then focus on control mechanisms in video diffusion models, analyzing how different types of conditions can be incorporated into the denoising process to guide generation. Finally, we categorize existing methods based on the types of control signals they leverage, including single-condition generation, multi-condition generation, and universal controllable generation. For a complete list of the literature on controllable video generation reviewed, please visit our curated repository at https://github.com/mayuelala/Awesome-Controllable-Video-Generation.

preprint2026arXiv

Docs2Synth: A Synthetic Data Trained Retriever Framework for Scanned Visually Rich Documents Understanding

Document understanding (VRDU) in regulated domains is particularly challenging, since scanned documents often contain sensitive, evolving, and domain specific knowledge. This leads to two major challenges: the lack of manual annotations for model adaptation and the difficulty for pretrained models to stay up-to-date with domain-specific facts. While Multimodal Large Language Models (MLLMs) show strong zero-shot abilities, they still suffer from hallucination and limited domain grounding. In contrast, discriminative Vision-Language Pre-trained Models (VLPMs) provide reliable grounding but require costly annotations to cover new domains. We introduce Docs2Synth, a synthetic-supervision framework that enables retrieval-guided inference for private and low-resource domains. Docs2Synth automatically processes raw document collections, generates and verifies diverse QA pairs via an agent-based system, and trains a lightweight visual retriever to extract domain-relevant evidence. During inference, the retriever collaborates with an MLLM through an iterative retrieval--generation loop, reducing hallucination and improving response consistency. We further deliver Docs2Synth as an easy-to-use Python package, enabling plug-and-play deployment across diverse real-world scenarios. Experiments on multiple VRDU benchmarks show that Docs2Synth substantially enhances grounding and domain generalization without requiring human annotations.

preprint2026arXiv

Dose-LET Interactions Predict Capsular Contracture After Proton Postmastectomy Radiation Therapy

Pencil beam scanning (PBS) proton therapy provides highly conformal dose distributions that are increasingly leveraged for postmastectomy radiation therapy (PMRT) to reduce cardiopulmonary exposure. However, implant-based reconstruction in the setting of PMRT remains vulnerable to capsular contracture, and biological mechanisms of possible high linear energy transfer (LET) in PBS have not been well characterized. A retrospective case-control study was conducted on consecutive breast cancer patients who underwent mastectomy followed by implant-based reconstruction and proton PMRT (50 Gy in 25 fractions) between 2015 and 2021. Dose-LET volume histograms (DLVHs) were calculated for peri-implant tissue (5-mm shell around the implant). Generalized linear mixed-effects regression (GLMER) was employed to identify DLVH indices significantly associated with capsular contracture. Spearman correlation analysis was used to eliminate redundance. DLVCs were derived from receiver operating characteristic (ROC) analysis and validated using support vector machine (SVM)-based normal tissue complication probability (NTCP) model. Eight capsular contracture and 16 matched controls patients were analyzed. Three independent and significant DLVH indices were identified(p<0.01). The corresponding DLVCs were: V(55.8 Gy[RBE=1.1], 2.2 keV/μm) < 0.0033%, V(50.3 Gy[RBE=1.1], 5.4 keV/μm) < 0.0017%, and V(32.8 Gy[RBE=1.1], 0.9 keV/μm) > 96.98%. The SVM-based NTCP model achieved an area under the ROC curve (AUROC) of 0.867, with 91.7% accuracy, 87.5% sensitivity, and 93.8% specificity. Capsular contracture following proton PMRT is significantly associated with the synergistic interplay between dose and LETd in peri-implant tissue. The derived DLVCs provide actionable dosimetric constraints that can be integrated into treatment planning to minimize capsular contracture risk in proton PMRT.

preprint2026arXiv

Enhancing Imbalanced Electrocardiogram Classification: A Novel Approach Integrating Data Augmentation through Wavelet Transform and Interclass Fusion

Imbalanced electrocardiogram (ECG) data hampers the efficacy and resilience of algorithms in the automated processing and interpretation of cardiovascular diagnostic information, which in turn impedes deep learning-based ECG classification. Notably, certain cardiac conditions that are infrequently encountered are disproportionately underrepresented in these datasets. Although algorithmic generation and oversampling of specific ECG signal types can mitigate class skew, there is a lack of consensus regarding the effectiveness of such techniques in ECG classification. Furthermore, the methodologies and scenarios of ECG acquisition introduce noise, further complicating the processing of ECG data. This paper presents a significantly enhanced ECG classifier that simultaneously addresses both class imbalance and noise-related challenges in ECG analysis, as observed in the CPSC 2018 dataset. Specifically, we propose the application of feature fusion based on the wavelet transform, with a focus on wavelet transform-based interclass fusion, to generate the training feature library and the test set feature library. Subsequently, the original training and test data are amalgamated with their respective feature databases, resulting in more balanced training and test datasets. Employing this approach, our ECG model achieves recognition accuracies of up to 99%, 98%, 97%, 98%, 96%, 92%, and 93% for Normal, AF, I-AVB, LBBB, RBBB, PAC, PVC, STD, and STE, respectively. Furthermore, the average recognition accuracy for these categories ranges between 92\% and 98\%. Notably, our proposed data fusion methodology surpasses any known algorithms in terms of ECG classification accuracy in the CPSC 2018 dataset.

preprint2026arXiv

Fabry-Pérot Metacavities with Single-Layered Dielectric Metamirrors

The Fabry-Pérot resonator is a cornerstone of photonics and wave physics, providing a universal mechanism for spectral confinement and resonant enhancement of wave-matter interactions. In this work, we establish an analytically tractable class of Fabry-Pérot metacavities in which the reflecting elements are realized by single-layer periodic arrays of circular dielectric cylinders acting as metamirrors. Both the reflection efficiency and reflection phase of such metamirrors are obtained in closed form and shown to be widely and independently tunable, encompassing ideal electric and magnetic mirror limits with unit reflectivity. Building on these results, we derive explicit analytical expressions that fully describe the optical responses of Fabry-Pérot cavities composed of two such parallel metamirrors. Our combined analytical and numerical investigations reveal that these metamirrors provide exceptional flexibility for tailoring Fabry-Pérot resonances across a broad spectral range, enabling precise control over resonance positions and quality factors. In particular, the framework naturally predicts the emergence of Fabry-Pérot bound states in the continuum with formally infinite Q-factors. These results establish dielectric-metamirror-based Fabry-Pérot cavities as a versatile and fundamentally transparent platform for engineering high-Q optical resonances.

preprint2026arXiv

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates significant potential in enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing RLVR methods are often constrained by issues such as coarse-grained rewards, reward noise, and inefficient exploration, which lead to unstable training and entropy collapse. To address this challenge, we propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO). The intuition behind it lies in the fact that the probabilities of an LLM generating different responses can inherently and directly reflect its self-assessment of the reasoning process. Inspired by the idea of preference modeling, ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt, and integrates this score with verifiable rewards to guide the exploration process. We have discovered that the preference advantage score not only alleviates the issues of coarse-grained rewards and reward noise but also effectively curbs overconfident errors, enhances the relative superiority of undervalued high-quality responses, and prevents the model from overfitting to specific strategies. Comprehensive experiments across four general-domain benchmarks and three mathematical benchmarks demonstrate that ICPO steadily boosts reasoning compared to GRPO.

preprint2026arXiv

Implicitly Restarted Lanczos Enables Chemically-Accurate Shallow Neural Quantum States

The variational optimization of high-dimensional neural network models, such as those used in neural quantum states (NQS), presents a significant challenge in machine intelligence. Conventional first-order stochastic methods (e.g., Adam) are plagued by slow convergence, sensitivity to hyperparameters, and numerical instability, preventing NQS from reaching the high accuracy required for fundamental science. We address this fundamental optimization bottleneck by introducing the implicitly restarted Lanczos (IRL) method as the core engine for NQS training. Our key innovation is an inherently stable second-order optimization framework that recasts the ill-conditioned parameter update problem into a small, well-posed Hermitian eigenvalue problem. By solving this problem efficiently and robustly with IRL, our approach automatically determines the optimal descent direction and step size, circumventing the need for demanding hyperparameter tuning and eliminating the numerical instabilities common in standard iterative solvers. We demonstrate that IRL enables shallow NQS architectures (with orders of magnitude fewer parameters) to consistently achieve extreme precision (1e-12 kcal/mol) in just 3 to 5 optimization steps. For the F2 molecule, this translates to an approximate 17,900-fold speed-up in total runtime compared to Adam. This work establishes IRL as a superior, robust, and efficient second-order optimization strategy for variational quantum models, paving the way for the practical, high-fidelity application of neural networks in quantum physics and chemistry.

preprint2026arXiv

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

The exponential growth of Large Language Models (LLMs) continues to highlight the need for efficient strategies to meet ever-expanding computational and data demands. This survey provides a comprehensive analysis of two complementary paradigms: Knowledge Distillation (KD) and Dataset Distillation (DD), both aimed at compressing LLMs while preserving their advanced reasoning capabilities and linguistic diversity. We first examine key methodologies in KD, such as task-specific alignment, rationale-based training, and multi-teacher frameworks, alongside DD techniques that synthesize compact, high-impact datasets through optimization-based gradient matching, latent space regularization, and generative synthesis. Building on these foundations, we explore how integrating KD and DD can produce more effective and scalable compression strategies. Together, these approaches address persistent challenges in model scalability, architectural heterogeneity, and the preservation of emergent LLM abilities. We further highlight applications across domains such as healthcare and education, where distillation enables efficient deployment without sacrificing performance. Despite substantial progress, open challenges remain in preserving emergent reasoning and linguistic diversity, enabling efficient adaptation to continually evolving teacher models and datasets, and establishing comprehensive evaluation protocols. By synthesizing methodological innovations, theoretical foundations, and practical insights, our survey charts a path toward sustainable, resource-efficient LLMs through the tighter integration of KD and DD principles.

preprint2026arXiv

KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.

preprint2026arXiv

Lamperti Operators, Dilation Theory, and Applications in Noncommutative Ergodic Theory

In this paper, we develop a novel framework for quantitative mean ergodic theorems in the noncommutative setting, with a focus on actions of amenable groups and semigroups. We prove square function inequalities for ergodic averages arising from actions of groups of polynomial volume growth on a fixed noncommutative $L_p$-space for $1<p<\8$. To achieve this, we establish two endpoint estimates for a noncommutative square function on non-homogeneous space. Our approach relies on semi-commutative non-homogeneous harmonic analysis, including the non-doubling Calderón-Zygmund arguments for non-smooth kernels and $\mathrm{BMO}$ space theory, operator-valued inequalities related to balls and cubes in groups equipped with non-doubling measures, and a noncommutative generalization of the classical transference method for amenable group actions. As an application, we establish a quantitative ergodic theorem for the ergodic averages associated with the positive power of modulus representation arising from a Lamperti representation on noncommutative $L_p$-spaces, extending some results in \cite{Templeman2015}. To obtain quantitative ergodic theorem for semigroups of operators, in this paper, we address the open question of extending dilation theorem of Fackler-Glück from single operators to commuting tuples on Banach spaces including noncommutative $L_p$-spaces. Indeed our approach provides genuine joint $N$-dilations for commuting families, unifying and extending the classical dilation theorems of Sz.-Nagy--Foiaş and Akçoglu--Sucheston for a natural class of commuting tuple of contractions extending the abstract dilation theorem of of Fackler-Glück for commuting tuple of contractions. This enables us to obtain a quantitative ergodic theorem for a large class of semigroups of operators on $\mathbb{R}^d_{+}$.

preprint2026arXiv

Pulse thermal imaging of FUHAO bronze artifact

The accurate identification of historical restoration traces and material degradation is essential for the scientific preservation of ancient bronzes. In this study, the prestigious FUHAO bronze artifact (late Shang period, 13th-11th century BCE) was non-destructively examined using pulsed thermal imaging (PT). By combining single- and double-layer heat conduction models with Thermal Tomography (TT), this approach allowed for precise spatial localization of repair crevices, patches, and filler materials, while also distinguishing restorative interventions from the original bronze substrate. The artifact was revealed to have been assembled from multiple fragments, exhibiting uneven surface corrosion and clear evidence of prior conservation. The results not only provide direct insights for conservation strategy and historical interpretation but also demonstrate the capability of pulsed thermal imaging as an effective diagnostic tool for the integrated surface and subsurface assessment of cultural heritage objects.

preprint2026arXiv

Revealing Neutrino Mass Ordering at CEPC and FCC-ee

The neutrino masses ordering remains one of the most important open questions in neutrino physics. While upcoming oscillation experiments aim to resolve this problem at low energies, complementary approaches are highly desirable. In this Letter, we show that the neutrino mass ordering can be probed at high-energy colliders through the lepton-flavor structure of heavy neutral lepton (HNL) interactions. In the minimal Type-I seesaw scenario with two nearly degenerate HNLs, the heavy--light neutrino mixings are strongly correlated with the light-neutrino mass spectrum, leading to distinct flavor patterns for the normal and inverted hierarchies. We demonstrate that future $Z$ factories, such as CEPC and FCC-ee, can probe the neutrino mass ordering for total HNL mixings as small as $U_{\rm tot}^2 \gtrsim 4 \times 10^{-9}$, and discriminate between the two hierarchies for $U_{\rm tot}^2 \gtrsim 10^{-6}$. Our results establish collider searches for HNLs as a powerful and complementary probe of the neutrino mass ordering.

preprint2026arXiv

Sparse Convex Biclustering

Biclustering is an essential unsupervised machine learning technique for simultaneously clustering rows and columns of a data matrix, with widespread applications in genomics, transcriptomics, and other high-dimensional omics data. Despite its importance, existing biclustering methods struggle to meet the demands of modern large-scale datasets. The challenges stem from the accumulation of noise in high-dimensional features, the limitations of non-convex optimization formulations, and the computational complexity of identifying meaningful biclusters. These issues often result in reduced accuracy and stability as the size of the dataset increases. To overcome these challenges, we propose Sparse Convex Biclustering (SpaCoBi), a novel method that penalizes noise during the biclustering process to improve both accuracy and robustness. By adopting a convex optimization framework and introducing a stability-based tuning criterion, SpaCoBi achieves an optimal balance between cluster fidelity and sparsity. Comprehensive numerical studies, including simulations and an application to mouse olfactory bulb data, demonstrate that SpaCoBi significantly outperforms state-of-the-art methods in accuracy. These results highlight SpaCoBi as a robust and efficient solution for biclustering in high-dimensional and large-scale datasets.

preprint2026arXiv

Spectral Complex Autoencoder Pruning: A Fidelity-Guided Criterion for Extreme Structured Channel Compression

We propose Spectral Complex Autoencoder Pruning (SCAP), a reconstruction-based criterion that measures functional redundancy at the level of individual output channels. For each convolutional layer, we construct a complex interaction field by pairing the full multi-channel input activation as the real part with a single output-channel activation (spatially aligned and broadcast across input channels) as the imaginary part. We transform this complex field to the frequency domain and train a low-capacity autoencoder to reconstruct normalized spectra. Channels whose spectra are reconstructed with high fidelity are interpreted as lying close to a low-dimensional manifold captured by the autoencoder and are therefore more compressible; conversely, channels with low fidelity are retained as they encode information that cannot be compactly represented by the learned manifold. This yields an importance score (optionally fused with the filter L1 norm) that supports simple threshold-based pruning and produces a structurally consistent pruned network. On VGG16 trained on CIFAR-10, at a fixed threshold of 0.6, we obtain 90.11% FLOP reduction and 96.30% parameter reduction with an absolute Top-1 accuracy drop of 1.67% from a 93.44% baseline after fine-tuning, demonstrating that spectral reconstruction fidelity of complex interaction fields is an effective proxy for channel-level redundancy under aggressive compression.

preprint2026arXiv

Think-J: Learning to Think for Generative LLM-as-a-Judge

LLM-as-a-Judge refers to the automatic modeling of preferences for responses generated by Large Language Models (LLMs), which is of significant importance for both LLM evaluation and reward modeling. Although generative LLMs have made substantial progress in various tasks, their performance as LLM-Judge still falls short of expectations. In this work, we propose Think-J, which improves generative LLM-as-a-Judge by learning how to think. We first utilized a small amount of curated data to develop the model with initial judgment thinking capabilities. Subsequently, we optimize the judgment thinking traces based on reinforcement learning (RL). We propose two methods for judgment thinking optimization, based on offline and online RL, respectively. The offline method requires training a critic model to construct positive and negative examples for learning. The online method defines rule-based reward as feedback for optimization. Experimental results showed that our approach can significantly enhance the evaluation capability of generative LLM-Judge, surpassing both generative and classifier-based LLM-Judge without requiring extra human annotations.

preprint2026arXiv

Time Travel Engine: A Shared Latent Chronological Manifold Enables Historical Navigation in Large Language Models

Time functions as a fundamental dimension of human cognition, yet the mechanisms by which Large Language Models (LLMs) encode chronological progression remain opaque. We demonstrate that temporal information in their latent space is organized not as discrete clusters but as a continuous, traversable geometry. We introduce the Time Travel Engine (TTE), an interpretability-driven framework that projects diachronic linguistic patterns onto a shared chronological manifold. Unlike surface-level prompting, TTE directly modulates latent representations to induce coherent stylistic, lexical, and conceptual shifts aligned with target eras. By parameterizing diachronic evolution as a continuous manifold within the residual stream, TTE enables fluid navigation through period-specific &#34;zeitgeists&#34; while restricting access to future knowledge. Furthermore, experiments across diverse architectures reveal topological isomorphism between the temporal subspaces of Chinese and English-indicating that distinct languages share a universal geometric logic of historical evolution. These findings bridge historical linguistics with mechanistic interpretability, offering a novel paradigm for controlling temporal reasoning in neural networks.

preprint2025arXiv

AzeroS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning

Extending large language models (LLMs) to the speech domain has recently gained significant attention. A typical approach connects a pretrained LLM with an audio encoder through a projection module and trains the resulting model on large-scale, task-specific instruction-tuning datasets. However, curating such instruction-tuning data for specific requirements is time-consuming, and models trained in this manner often generalize poorly to unseen tasks. In this work, we first formulate that the strongest generalization of a speech-LLM is achieved when it is trained with Self-Generated Instruction-Free Tuning (SIFT), in which supervision signals are generated by a frozen LLM using textual representations of speech as input. Our proposed SIFT paradigm eliminates the need for collecting task-specific question-answer pairs and yields the theoretically best generalization to unseen tasks. Building upon this paradigm, we introduce AZeroS (Auden Zero-instruction-tuned Speech-LLM), which is trained on speech-text pairs derived from publicly available corpora, including approximately 25,000 hours of speech with ASR transcripts and 3,000 hours of speech with paralinguistic labels. Built upon Qwen2.5-7B-Instruct, the model updates only two lightweight projection modules (23.8 million parameters each), while keeping both the LLM and audio encoders frozen. Despite the minimal training cost and modest data scale, AZeroS achieves state-of-the-art performance on both semantic and paralinguistic benchmarks, including VoiceBench, AIR-Bench Foundation (Speech), and AIR-Bench Chat (Speech).

preprint2021arXiv

Quantitative ergodic theorems for actions of groups of polynomial growth

We strengthen the maximal ergodic theorem for actions of groups of polynomial growth to a form involving jump quantity, which is the sharpest result among the family of variational or maximal ergodic theorems. As a consequence, we deduce in this setting the quantitative ergodic theorem, in particular, the upcrossing inequalities with exponential decay. The ideas or techniques involve probability theory, non-doubling Calderón-Zygmund theory, almost orthogonality argument and some delicate geometric argument involving the balls and the cubes on the group equipped with a not necessarily doubling measure.