Source author record

Zhifu Gao

Zhifu Gao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Sound eess.AS astro-ph.HE Computation and Language Multimedia

Catalog footprint

What is connected

4works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Does the radio-active phase of XTE~J1810$-$197 recur following the same evolutionary pattern?

Magnetars are the most strongly magnetized compact objects known in the Universe and are regarded as one of the primary engines powering a variety of enigmatic, high-energy transients. However, our understanding of magnetars remains highly limited, constrained by observational sample size and radiative variability. XTE~J1810$-$197, which re-entered a radio-active phase in 2018, is one of only six known radio-pulsating magnetars. Leveraging the distinctive capability for simultaneous dual-frequency observations, we utilized the Shanghai Tianma Radio Telescope (TMRT) to monitor this magnetar continuously at both 2.25 and 8.60~GHz, capturing its entire evolution from radio activation to quenching. This enabled precise characterization of the evolution in its integrated profile, spin frequency, flux density, and spectral index ($α$, defined by $S \propto f^α$). The first time derivative of its spin frequency $\dotν$ passed through four distinct phases -- rapid decrease, violent oscillation, steady decline, and stable recovery -- before returning to its pre-outburst value concomitant with the cessation of radio emission. Remarkably, both the amplitudes and the characteristic time-scales of these $\dotν$ variations match those observed during the previous outburst that began in 2003, providing the first demonstration that post-outburst rotational evolution and radiative behavior in a magnetar are repeatable. A twisted-magnetosphere model can qualitatively account for this repeatability as well as for the progressive narrowing and abrupt disappearance of the radio pulse radiation, thereby receiving strong observational support.

preprint2026arXiv

SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.

preprint2020arXiv

SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition

End-to-end speech recognition has become popular in recent years, since it can integrate the acoustic, pronunciation and language models into a single neural network. Among end-to-end approaches, attention-based methods have emerged as being superior. For example, Transformer, which adopts an encoder-decoder architecture. The key improvement introduced by Transformer is the utilization of self-attention instead of recurrent mechanisms, enabling both encoder and decoder to capture long-range dependencies with lower computational complexity.In this work, we propose boosting the self-attention ability with a DFSMN memory block, forming the proposed memory equipped self-attention (SAN-M) mechanism. Theoretical and empirical comparisons have been made to demonstrate the relevancy and complementarity between self-attention and the DFSMN memory block. Furthermore, the proposed SAN-M provides an efficient mechanism to integrate these two modules. We have evaluated our approach on the public AISHELL-1 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. On both tasks, SAN-M systems achieved much better performance than the self-attention based Transformer baseline system. Specially, it can achieve a CER of 6.46% on the AISHELL-1 task even without using any external LM, comfortably outperforming other state-of-the-art systems.

preprint2020arXiv

Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition

Recently, streaming end-to-end automatic speech recognition (E2E-ASR) has gained more and more attention. Many efforts have been paid to turn the non-streaming attention-based E2E-ASR system into streaming architecture. In this work, we propose a novel online E2E-ASR system by using Streaming Chunk-Aware Multihead Attention(SCAMA) and a latency control memory equipped self-attention network (LC-SAN-M). LC-SAN-M uses chunk-level input to control the latency of encoder. As to SCAMA, a jointly trained predictor is used to control the output of encoder when feeding to decoder, which enables decoder to generate output in streaming manner. Experimental results on the open 170-hour AISHELL-1 and an industrial-level 20000-hour Mandarin speech recognition tasks show that our approach can significantly outperform the MoChA-based baseline system under comparable setup. On the AISHELL-1 task, our proposed method achieves a character error rate (CER) of 7.39%, to the best of our knowledge, which is the best published performance for online ASR.

Zhifu Gao

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

Does the radio-active phase of XTE~J1810$-$197 recur following the same evolutionary pattern?

SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing

SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition

Streaming Chunk-Aware Multihead Attention for Online End-to-End Speech Recognition