Source author record

Bowen Shi

Bowen Shi appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

eess.AS Sound Computer Vision cond-mat.str-el hep-th Machine Learning quant-ph Computation and Language Artificial Intelligence Software Engineering astro-ph.CO cs.CY hep-ph math.QA Social and Information Networks

Catalog footprint

What is connected

19works

15topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

The rapid advancement of generative audio models has outpaced the development of robust evaluation methodologies. Existing objective metrics and general multimodal large language models (MLLMs) often struggle with domain generalization, zero-shot capabilities, and instructional flexibility. To address these bottlenecks, we propose JASTIN, a generalizable, instruction-driven audio evaluation framework that formulates audio assessment as a self-instructed reasoning task. JASTIN bridges a frozen high-performance audio encoder with a fine-tuned LLM backbone via a trainable audio adapter. To ensure robust zero-shot generalization, we introduce a comprehensive instruction following data preparation pipeline, incorporating Multi-Source, Multi-Task, Multi-Calibration, and Multi-Description data. Experimental results demonstrate that JASTIN achieves state-of-the-art Pearson and Spearman correlations with human subjective ratings. It consistently outperforms general MLLMs across speech, sound, music, and out-of-domain evaluation tasks without the need for task-specific retraining.

preprint2026arXiv

Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming

Diffusion-based large language models (dLLMs) have emerged as a promising paradigm, utilizing simultaneous denoising to enable global planning and iterative refinement. While these capabilities are particularly advantageous for long-context generation, deploying such models faces a prohibitive memory capacity barrier stemming from severe system inefficiencies. We identify that existing inference systems are ill-suited for this paradigm: unlike autoregressive models constrained by the cumulative KV-cache, dLLMs are bottlenecked by transient activations recomputed at every step. Furthermore, general-purpose memory reuse mechanisms lack the global visibility to adapt to dLLMs' dynamic memory peaks, which toggle between logits and FFNs. To address these mismatches, we propose Mosaic, a memory-efficient inference system that shifts from local, static management to a global, dynamic paradigm. Mosaic integrates a mask-only logits kernel to eliminate redundancy, a lazy chunking optimizer driven by an online heuristic search to adaptively mitigate dynamic peaks, and a global memory manager to resolve fragmentation via virtual addressing. Extensive evaluations demonstrate that Mosaic achieves an average 2.71$\times$ reduction in the memory peak-to-average ratio and increases the maximum inference sequence length supportable on identical hardware by 15.89-32.98$\times$. This scalability is achieved without compromising accuracy and speed, and in fact reducing latency by 4.12%-23.26%.

preprint2023arXiv

Visual Story Generation Based on Emotion and Keywords

Automated visual story generation aims to produce stories with corresponding illustrations that exhibit coherence, progression, and adherence to characters' emotional development. This work proposes a story generation pipeline to co-create visual stories with the users. The pipeline allows the user to control events and emotions on the generated content. The pipeline includes two parts: narrative and image generation. For narrative generation, the system generates the next sentence using user-specified keywords and emotion labels. For image generation, diffusion models are used to create a visually appealing image corresponding to each generated sentence. Further, object recognition is applied to the generated images to allow objects in these images to be mentioned in future story development.

preprint2022arXiv

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%). Our code and models are available at https://github.com/facebookresearch/av_hubert

preprint2022arXiv

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs. Our study focuses on the Audio-Visual Hidden Unit BERT (AV-HuBERT) approach, a recently developed general-purpose audio-visual speech pre-training framework. We conducted extensive experiments probing the effectiveness of pre-training and visual modality. Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, improving label efficiency by roughly ten fold for both audio-only and audio-visual speaker verification. We also show that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38% in the clean condition and 75% in noisy conditions.

preprint2022arXiv

Robust Self-Supervised Audio-Visual Speech Recognition

Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the progress was hindered by the amount of labeled data available. In this work, we present a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model. On the largest available AVSR benchmark dataset LRS3, our approach outperforms prior state-of-the-art by ~50% (28.0% vs. 14.1%) using less than 10% of labeled data (433hr vs. 30hr) in the presence of babble noise, while reducing the WER of an audio-based model by over 75% (25.8% vs. 5.8%) on average.

preprint2022arXiv

Searching for fingerspelled content in American Sign Language

Natural language processing for sign language video - including tasks like recognition, translation, and search - is crucial for making artificial intelligence technologies accessible to deaf individuals, and is gaining research interest in recent years. In this paper, we address the problem of searching for fingerspelled key-words or key phrases in raw sign language videos. This is an important task since significant content in sign language is often conveyed via fingerspelling, and to our knowledge the task has not been studied before. We propose an end-to-end model for this task, FSS-Net, that jointly detects fingerspelling and matches it to a text sequence. Our experiments, done on a large public dataset of ASL fingerspelling in the wild, show the importance of fingerspelling detection as a component of a search and retrieval model. Our model significantly outperforms baseline methods adapted from prior work on related tasks

preprint2021arXiv

Chiral central charge from a single bulk wave function

A $(2+1)$-dimensional gapped quantum many-body system can have a topologically protected energy current at its edge. The magnitude of this current is determined entirely by the temperature and the chiral central charge, a quantity associated with the effective field theory of the edge. We derive a formula for the chiral central charge that, akin to the topological entanglement entropy, is completely determined by the many-body ground state wave function in the bulk. According to our formula, nonzero chiral central charge gives rise to a topological obstruction that prevents the ground state wave function from being real-valued in any local product basis.

preprint2021arXiv

Modular commutator in gapped quantum many-body systems

In arXiv:2110.06932, we argued that the chiral central charge -- a topologically protected quantity characterizing the edge theory of a gapped (2+1)-dimensional system -- can be extracted from the bulk by using an order parameter called the modular commutator. In this paper, we reveal general properties of the modular commutator and strengthen its relationship with the chiral central charge. First, we identify connections between the modular commutator and conditional mutual information, time reversal, and modular flow. Second, we prove, within the framework of the entanglement bootstrap program, that two topologically ordered media connected by a gapped domain wall must have the same modular commutator in their respective bulk. Third, we numerically calculate the value of the modular commutator for a bosonic lattice Laughlin state for finite sizes and extrapolate to the infinite-volume limit. The result of this extrapolation is consistent with the proposed formula up to an error of about 0.7%.

preprint2020arXiv

A Cross-Task Analysis of Text Span Representations

Many natural language processing (NLP) tasks involve reasoning with textual spans, including question answering, entity recognition, and coreference resolution. While extensive research has focused on functional architectures for representing words and sentences, there is less work on representing arbitrary spans of text within sentences. In this paper, we conduct a comprehensive empirical evaluation of six span representation methods using eight pretrained language representation models across six tasks, including two tasks that we introduce. We find that, although some simple span representations are fairly reliable across tasks, in general the optimal span representation varies by task, and can also vary within different facets of individual tasks. We also find that the choice of span representation has a bigger impact with a fixed pretrained encoder than with a fine-tuned encoder.

preprint2020arXiv

A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling

This paper proposes a network architecture mainly designed for audio tagging, which can also be used for weakly supervised acoustic event detection (AED). The proposed network consists of a modified DenseNet as the feature extractor, and a global average pooling (GAP) layer to predict frame-level labels at inference time. This architecture is inspired by the work proposed by Zhou et al., a well-known framework using GAP to localize visual objects given image-level labels. While most of the previous works on weakly supervised AED used recurrent layers with attention-based mechanism to localize acoustic events, the proposed network directly localizes events using the feature map extracted by DenseNet without any recurrent layers. In the audio tagging task of DCASE 2017, our method significantly outperforms the state-of-the-art method in F1 score by 5.3% on the dev set, and 6.0% on the eval set in terms of absolute values. For weakly supervised AED task in DCASE 2018, our model outperforms the state-of-the-art method in event-based F1 by 8.1% on the dev set, and 0.5% on the eval set in terms of absolute values, by using data augmentation and tri-training to leverage unlabeled data.

preprint2020arXiv

An Empirical Study of Usages, Updates and Risks of Third-Party Libraries in Java Projects

Third-party libraries are a central building block to develop software systems. However, outdated third-party libraries are commonly used, and developers are usually less aware of the potential risks. Therefore, a quantitative and holistic study on usages, updates and risks of third-party libraries can provide practical insights to improve the ecosystem sustainably. In this paper, we conduct such a study in the Java ecosystem. Specifically, we conduct a library usage analysis (e.g., usage intensity and outdatedness) and a library update analysis (e.g., update intensity and delay) using 806 open-source projects. The two analyses aim to quantify usage and update practices holistically from the perspective of both open-source projects and third-party libraries. Then, we conduct a library risk analysis (e.g., potential risk and developer response) in terms of bugs with 15 popularly-used third-party libraries. This analysis aims to quantify the potential risk of using outdated libraries and the developer response to the risk. Our findings from the three analyses provide practical insights to developers and researchers on problems and potential solutions in maintaining third-party libraries (e.g., smart alerting and automated updating of outdated libraries). To demonstrate the usefulness of our findings, we propose a bug-driven alerting system for assisting developers to make confident decisions in updating third-party library versions. We have released our dataset to foster valuable applications and improve the ecosystem.

preprint2020arXiv

Behavior variations and their implications for popularity promotions: From elites to mass in Weibo

The boom in social media with regard to producing and consuming information simultaneously implies the crucial role of online user influence in determining content popularity. In particular, understanding behavior variations between the influential elites and the mass grassroots is an important issue in communication. However, how their behavior varies across user categories and content domains, and how these differences influence content popularity are rarely addressed. From a novel view of seven content-domains, a detailed picture of behavior variations among five user groups, from both views of elites and mass, is drawn in Weibo, one of the most popular Twitter-like services in China. Interestingly, elites post more diverse contents with video links while the mass possess retweeters of higher loyalty. According to these variations, user-oriented actions of enhancing content popularity are discussed and testified. The most surprising finding is that the diversity of contents do not always bring more retweets, and the mass and elites should promote content popularity by increasing their retweeter counts and loyalty, respectively. Our results for the first time demonstrate the possibility of highly individualized strategies of popularity promotions in social media, instead of a universal principle.

preprint2020arXiv

Few-shot acoustic event detection via meta-learning

We study few-shot acoustic event detection (AED) in this paper. Few-shot learning enables detection of new events with very limited labeled data. Compared to other research areas like computer vision, few-shot learning for audio recognition has been under-studied. We formulate few-shot AED problem and explore different ways of utilizing traditional supervised methods for this setting as well as a variety of meta-learning approaches, which are conventionally used to solve few-shot classification problem. Compared to supervised baselines, meta-learning models achieve superior performance, thus showing its effectiveness on generalization to new audio events. Our analysis including impact of initialization and domain discrepancy further validate the advantage of meta-learning approaches in few-shot AED.

preprint2020arXiv

Fusion rules from entanglement

We derive some of the axioms of the algebraic theory of anyon [A. Kitaev, Ann. Phys., 321, 2 (2006)] from a conjectured form of entanglement area law for two-dimensional gapped systems. We derive the fusion rules of topological charges and show that the multiplicities of the fusion rules satisfy these axioms. Moreover, even though we make no assumption about the exact value of the constant sub-leading term of the entanglement entropy of a disk-like region, this term is shown to be equal to $\ln \mathcal{D}$, where $\mathcal{D}$ is the total quantum dimension of the underlying anyon theory. These derivations are rigorous and follow from the entanglement area law alone. More precisely, our framework starts from two local entropic constraints, which are implied by the area law. From these constraints, we prove what we refer to as the "isomorphism theorem." The existence of superselection sectors and fusion multiplicities follows from this theorem, even without assuming anything about the parent Hamiltonian. These objects and the axioms of the anyon theory are shown to emerge from the structure and the internal self-consistency relations of the information convex sets.

preprint2020arXiv

Interactive, Effort-Aware Library Version Harmonization

As a mixed result of intensive dependency on third-party libraries, flexible mechanism to declare dependencies, and increased number of modules in a project, multiple versions of the same third-party library are directly depended in different modules of a project. Such library version inconsistencies can increase dependency maintenance cost, or even lead to dependency conflicts when modules are inter-dependent. Although automated build tools (e.g., Maven's enforcer plugin) provide partial support to detect library version inconsistencies, they do not provide any support to harmonize inconsistent library versions. We first conduct a survey with 131 Java developers from GitHub to retrieve first-hand information about the root causes, detection methods, reasons for fixing or not fixing, fixing strategies, fixing efforts, and tool expectations on library version inconsistencies. Then, based on the insights from our survey, we propose LibHarmo, an interactive, effort-aware library version harmonization technique, to detect library version inconsistencies, interactively suggest a harmonized version with the least harmonization efforts based on library API usage analysis, and refactor build configuration files. LibHarmo is currently developed for Java Maven projects. Our experimental study on 443 highly-starred Java Maven projects from GitHub indicates that i) LibHarmo identifies 621 library version inconsistencies covering 152 (34.3%) of projects, and ii) the average harmonization efforts are that 1 and 12 library API calls are affected, respectively due to the deleted and changed library APIs in the harmonized version. 5 library version inconsistencies have been confirmed, and 1 of them has been already harmonized by developers.

preprint2020arXiv

Latency-Aware Differentiable Neural Architecture Search

Differentiable neural architecture search methods became popular in recent years, mainly due to their low search costs and flexibility in designing the search space. However, these methods suffer the difficulty in optimizing network, so that the searched network is often unfriendly to hardware. This paper deals with this problem by adding a differentiable latency loss term into optimization, so that the search process can tradeoff between accuracy and latency with a balancing coefficient. The core of latency prediction is to encode each network architecture and feed it into a multi-layer regressor, with the training data which can be easily collected from randomly sampling a number of architectures and evaluating them on the hardware. We evaluate our approach on NVIDIA Tesla-P100 GPUs. With 100K sampled architectures (requiring a few hours), the latency prediction module arrives at a relative error of lower than 10%. Equipped with this module, the search method can reduce the latency by 20% meanwhile preserving the accuracy. Our approach also enjoys the ability of being transplanted to a wide range of hardware platforms with very few efforts, or being used to optimizing other non-differentiable factors such as power consumption.

preprint2020arXiv

Verlinde formula from entanglement

We derive the Verlinde formula from a recently advocated set of axioms about entanglement entropy [B. Shi, K. Kato, I. H. Kim, arXiv:1906.09376 (2019)]. For any state that obeys these axioms, we can define a quantity that can be identified as the topological $S$-matrix of an abstract anyon theory. We show that the $S$-matrix is unitary and that it recovers the fusion multiplicities of the underlying anyon theory through the Verlinde formula. Importantly, we rigorously prove the modularity of the theory, which further implies that the mutual braiding statistics of anyons are nontrivial. The key to the proof is a generalized quantum state merging technique, which generates a topology beyond that of any subsystem of the original physical system.

preprint2015arXiv

Basis invariant description of chemical equilibrium with implications for a recent axionic leptogenesis model

We provide a systematic treatment of chemical equilibrium in the presence of a specific type of time dependent background. The type of time dependent background we consider appears, for example, in recently proposed axion/Majoron leptogenesis models [1,2]. In describing the chemical equilibrium we use quantities which are invariant under redefinition of fermion phases (we refer to this redefinition as a change of basis for short), and therefore it is a basis invariant treatment. The change of the anomaly terms due to the change of the path integral measure [3,4] under a basis change is taken into account. We find it is useful to go back and forth between different bases, and there are insights which can be more easily obtained in one basis rather than another. A toy model is provided to illustrate the ideas. For the axion leptogenesis model [1], our result suggests that at $T > 10^{13}$ GeV , when sphaleron processes decouple, and $Γ_{B+L} << H < Γ_L$ (where $H$ is the Hubble parameter at temperature $T$ and $Γ_L$ is the $ΔL = 2$ lepton number violating interaction rate), the amount of $B-L$ created is controlled by the smallness of the sphaleron interaction rate, $Γ_{B+L}$. Therefore it is not as efficient as described. In addition, we notice an interesting modification of gauge boson dispersion relations at subleading order.

Bowen Shi

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

Mosaic: Unlocking Long-Context Inference for Diffusion LLMs via Global Memory Planning and Dynamic Peak Taming

Visual Story Generation Based on Emotion and Keywords

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

Robust Self-Supervised Audio-Visual Speech Recognition

Searching for fingerspelled content in American Sign Language

Chiral central charge from a single bulk wave function

Modular commutator in gapped quantum many-body systems

A Cross-Task Analysis of Text Span Representations

A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling

An Empirical Study of Usages, Updates and Risks of Third-Party Libraries in Java Projects

Behavior variations and their implications for popularity promotions: From elites to mass in Weibo

Few-shot acoustic event detection via meta-learning

Fusion rules from entanglement

Interactive, Effort-Aware Library Version Harmonization

Latency-Aware Differentiable Neural Architecture Search

Verlinde formula from entanglement

Basis invariant description of chemical equilibrium with implications for a recent axionic leptogenesis model