Source author record

Yizhou Lu

Yizhou Lu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

gr-qc eess.AS Sound astro-ph.CO hep-th Artificial Intelligence Computation and Language Machine Learning

Catalog footprint

What is connected

11works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We further show that pretrained VLMs provide strong action priors, significantly improving sample efficiency during RL training and reducing the need for manual design choices such as action engineering, compared to classical deep RL trained from scratch. Building on these insights, we introduce Odysseus, an open training framework for VLM agents, achieving substantial gains across multiple levels of the game and at least 3 times average game progresses than frontier models. Moreover, the trained models exhibit consistent improvements under both in-game and cross-game generalization settings, while maintaining general-domain capabilities. Overall, our results identify key ingredients for making RL stable and effective in long-horizon, multi-modal settings, and provide practical guidance for developing VLMs as embodied agents.

preprint2022arXiv

Effective reflected entropy and entanglement negativity for general 2D eternal black holes

Both reflected entropy and entanglement negativity provide valid measures of entanglement between subsystems of a mixed state. For general 2D eternal black holes coupled with CFT matters in large $c$ limit, we perform the replica-trick computation and find that both effective Renyi reflected entropy and effective entanglement negativity can be expressed in terms of the combination of modified backreacting cosmic branes in $\mathrm{AdS}_3$ bulk. We then develop a holographic scheme to calculate effective reflected entropy and entanglement negativity for general 2D eternal black holes coupled with CFT matters in large $c$ limit. Using the scheme, we check the consistency condition of the island formulae for entanglement negativity and reflected entropy. We find that the combinations of modified backreacting cosmic branes in $\mathrm{AdS}_3$ bulk from the two island proposals of entanglement negativity exactly match with each other. Finally, we study the saturation of the reflected entropy inequality.

preprint2022arXiv

Islands in Kaluza-Klein black holes

The newly proposed island formula for entanglement entropy of Hawking radiation is applied to spherically symmetric 4-dimensional eternal Kaluza-Klein (KK) black hole. The "charge" $Q$ of KK black holes quantifies its deviation from Schwarzschild black holes. The impact of $Q$ on the island is studied at late times. The late-time island, whose boundary is located outside but within a Planckian distance of the horizon, is slightly extended by $Q$. While the no-island entropy grows linearly, the late-time entanglement entropy is given by island configuration with twice the Bekenstein-Hawking entropy. Thus we reproduce the Page curve for the eternal KK black holes. Compared with Schwarzschild results, the Page time is delayed by a factor $(1+Q/r_h)$ and the scrambling time is prolonged by a factor $(1+Q/r_h)^{1/2}$. Moreover, the higher-dimensional generalization is presented. Skeptically, there are Planck length scales involved, in which a semi-classical description may break down.

preprint2022arXiv

Language Adaptive Cross-lingual Speech Representation Learning with Sparse Sharing Sub-networks

Unsupervised cross-lingual speech representation learning (XLSR) has recently shown promising results in speech recognition by leveraging vast amounts of unlabeled data across multiple languages. However, standard XLSR model suffers from language interference problem due to the lack of language specific modeling ability. In this work, we investigate language adaptive training on XLSR models. More importantly, we propose a novel language adaptive pre-training approach based on sparse sharing sub-networks. It makes room for language specific modeling by pruning out unimportant parameters for each language, without requiring any manually designed language specific component. After pruning, each language only maintains a sparse sub-network, while the sub-networks are partially shared with each other. Experimental results on a downstream multilingual speech recognition task show that our proposed method significantly outperforms baseline XLSR models on both high resource and low resource languages. Besides, our proposed method consistently outperforms other adaptation methods and requires fewer parameters.

preprint2022arXiv

Layer-wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition

Accent variability has posed a huge challenge to automatic speech recognition~(ASR) modeling. Although one-hot accent vector based adaptation systems are commonly used, they require prior knowledge about the target accent and cannot handle unseen accents. Furthermore, simply concatenating accent embeddings does not make good use of accent knowledge, which has limited improvements. In this work, we aim to tackle these problems with a novel layer-wise adaptation structure injected into the E2E ASR model encoder. The adapter layer encodes an arbitrary accent in the accent space and assists the ASR model in recognizing accented speech. Given an utterance, the adaptation structure extracts the corresponding accent information and transforms the input acoustic feature into an accent-related feature through the linear combination of all accent bases. We further explore the injection position of the adaptation layer, the number of accent bases, and different types of accent bases to achieve better accent adaptation. Experimental results show that the proposed adaptation structure brings 12\% and 10\% relative word error rate~(WER) reduction on the AESRC2020 accent dataset and the Librispeech dataset, respectively, compared to the baseline.

preprint2021arXiv

Gauge transformation of scalar induced tensor perturbation during matter domination

We study the scalar induced tensor perturbations at second order during matter domination in seven different gauges. Considering the obtained solution from the Newtonian gauge, we use the gauge transformation law of the scalar induced tensor perturbation to derive the solution in six other gauges. After identifying and eliminating the residual gauge modes in the synchronous and comoving orthogonal gauges, we obtain the same analytical results of the kernel function $I_χ$ for these two gauges as those obtained from the gauge transformation. For the scalar induced gravitational waves oscillating as $\sin x$ and $\cos x$, we find that $ρ_{\text{GW}}\propto a^{-4}$, and $Ω_{\text{GW}}\propto 1/a$ in the matter dominated era, so the oscillating gravitational waves behave as radiation.

preprint2021arXiv

The Accented English Speech Recognition Challenge 2020: Open Datasets, Tracks, Baselines, Results and Methods

The variety of accents has posed a big challenge to speech recognition. The Accented English Speech Recognition Challenge (AESRC2020) is designed for providing a common testbed and promoting accent-related research. Two tracks are set in the challenge -- English accent recognition (track 1) and accented English speech recognition (track 2). A set of 160 hours of accented English speech collected from 8 countries is released with labels as the training set. Another 20 hours of speech without labels is later released as the test set, including two unseen accents from another two countries used to test the model generalization ability in track 2. We also provide baseline systems for the participants. This paper first reviews the released dataset, track setups, baselines and then summarizes the challenge results and major techniques used in the submissions.

preprint2020arXiv

$α$-attractor from superconformal E-models in brane inflation

In the large extra dimensional braneworld inflation, Friedmann equation is modified to include a quadratic term in energy density with an additional parameter $λ$ called brane tension in addition to the usual linear term. The high energy brane corrections modify the slow-roll parameters and affect the behaviour of inflation. We analyse the superconformal inflation for E-models and find that there exist $α$-attractors in brane inflation. The predictions for the scalar spectral index $n_s$ and the tensor-to-scalar ratio $r$ are computed numerically, and approximate analytic formulas in the high energy limit have been given for the observable $n_s$ and $r$. The constraints on the model parameters are obtained by using Planck 2018 and BICEP2 observational data.

preprint2020arXiv

Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model

End-to-end (E2E) systems have played a more and more important role in automatic speech recognition (ASR) and achieved great performance. However, E2E systems recognize output word sequences directly with the input acoustic feature, which can only be trained on limited acoustic data. The extra text data is widely used to improve the results of traditional artificial neural network-hidden Markov model (ANN-HMM) hybrid systems. The involving of extra text data to standard E2E ASR systems may break the E2E property during decoding. In this paper, a novel modular E2E ASR system is proposed. The modular E2E ASR system consists of two parts: an acoustic-to-phoneme (A2P) model and a phoneme-to-word (P2W) model. The A2P model is trained on acoustic data, while extra data including large scale text data can be used to train the P2W model. This additional data enables the modular E2E ASR system to model not only the acoustic part but also the language part. During the decoding phase, the two models will be integrated and act as a standard acoustic-to-word (A2W) model. In other words, the proposed modular E2E ASR system can be easily trained with extra text data and decoded in the same way as a standard E2E ASR system. Experimental results on the Switchboard corpus show that the modular E2E model achieves better word error rate (WER) than standard A2W models.

preprint2020arXiv

On the waveform of the scalar induced gravitational waves

The scalar induced gravitational waves (SIGWs) is a useful tool to probe the physics in the early universe. To study inflationary models with this tool, we need to know how the waveform of SIGWs is related to the shape of the scalar power spectrum. We propose two parameterizations to approximate the scalar power spectrum with either a sharp or a broad spike at small scales, and then use these two parameterizations to study the relation between the shapes of $Ω_{GW}$ and the scalar power spectrum. We find that the waveform of SIGWs has a similar shape to the power spectrum. Away from the peak of the spike, the frequency relation $Ω_{GW}(k)\sim \mathcal{P}_ζ^2(k)$ holds independent of the functional form of the scalar power spectrum. We also give a physical explanation for this general relationship. The general relation is useful for determining the scalar power spectrum and probing inflationary physics with the waveform of SIGWs.

preprint2020arXiv

Primordial black holes and secondary gravitational waves from k/G inflation

The possibility that in the mass range around $10^{-12}\ M_\odot$ most of dark matter constitutes of primordial black holes (PBHs) is a very interesting topic. To produce PBHs with this mass, the primordial scalar power spectrum needs to be enhanced to the order of 0.01 at the scale $k\sim 10^{12}\ \text{Mpc}^{-1}$. The enhanced power spectrum also produces large secondary gravitational waves at the mHz band. A phenomenological delta function power spectrum is usually used to discuss the production of PBHs and secondary gravitational waves. Based on G and k inflations, we propose a new mechanism to enhance the power spectrum at small scales by introducing a non-canonical kinetic term $[1-2G(ϕ)]X$ with the function $G(ϕ)$ having a peak. Away from the peak, $G(ϕ)$ is negligible and we recover the usual slow-roll inflation which is constrained by the cosmic microwave background anisotrpy observations. Around the peak, the slow-roll inflation transiently turns to ultra slow-roll inflation. The enhancement of the power spectrum can be obtained with generic potentials, and there is no need to fine tune the parameters in $G(ϕ)$. The energy spectrum $Ω_{GW}(f)$ of secondary gravitational waves have the characteristic power law behaviour $Ω_{GW}(f)\sim f^{n}$ and is testable by pulsar timing array and space based gravitational wave detectors.

Yizhou Lu

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

Effective reflected entropy and entanglement negativity for general 2D eternal black holes

Islands in Kaluza-Klein black holes

Language Adaptive Cross-lingual Speech Representation Learning with Sparse Sharing Sub-networks

Layer-wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition

Gauge transformation of scalar induced tensor perturbation during matter domination

The Accented English Speech Recognition Challenge 2020: Open Datasets, Tracks, Baselines, Results and Methods

$α$-attractor from superconformal E-models in brane inflation

Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model

On the waveform of the scalar induced gravitational waves

Primordial black holes and secondary gravitational waves from k/G inflation