Researcher profile

Zhihua Wei

Zhihua Wei contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
11works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

11 published item(s)

preprint2026arXiv

Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement

Large Vision-Language Models (LVLMs) have achieved remarkable performance on diverse vision-language tasks. However, LVLMs still suffer from hallucinations, generating text that contradicts the visual input. Existing research has primarily focused on mitigating object hallucinations, but often overlooks more complex relation hallucinations, particularly action relations involving interactions between objects. In this study, we empirically observe that the primary cause of action-relation hallucinations in LVLMs is the insufficient attention allocated to visual information. Thus, we propose a framework to locate action-relevant image regions and enhance the LVLM's attention to those regions. Specifically, we define the Action-Relation Sensitivity (ARS) score to identify attention heads that are most sensitive to action-relation changes, thereby localizing action-relevant image regions that contain key visual cues. Then, we propose the Relation-aware Visual Enhancement (RVE) method to enhance the LVLM's attention to these action-relevant image regions. Extensive experiments demonstrate that, compared to existing baselines, our method achieves superior performance in mitigating action-relation hallucinations with negligible additional inference cost. Furthermore, it effectively generalizes to spatial-relation hallucinations and object hallucinations.

preprint2026arXiv

Multilingual Safety Alignment via Self-Distillation

Large language models (LLMs) exhibit severe multilingual safety misalignment: they possess strong safeguards in high-resource languages but remain highly vulnerable to jailbreak attacks in low-resource languages. Current safety alignment methods generally rely on high-quality response data for each target language, which is expensive and difficult to generate. In this paper, we propose a cross-lingual safeguard transfer framework named Multilingual Self-Distillation (MSD). This framework transfers an LLM's inherent safety capabilities from high-resource (e.g., English) to low-resource (e.g., Javanese) languages, overcoming the need for response data in any language. Our framework is flexible and can be integrated with different self-distillation strategies. Specifically, we implement two concrete methods -- on-policy MSD and off-policy MSD -- both of which enable effective cross-lingual safety transfer using only multilingual queries. Furthermore, we propose Dual-Perspective Safety Weighting (DPSW), a divergence measure to optimize the distillation objective. By jointly considering the perspectives of both the teacher and the student, DPSW adaptively increases the penalty weights on safety-critical tokens while reducing the weights on non-critical tokens. Extensive experiments on representative LLMs across diverse multilingual jailbreak and utility benchmarks demonstrate that our method consistently achieves superior multilingual safety performance. Notably, it generalizes effectively to more challenging datasets and unseen languages while preserving the model's general capabilities.

preprint2022arXiv

EfficientTDNN: Efficient Architecture Search for Speaker Recognition

Convolutional neural networks (CNNs), such as the time-delay neural network (TDNN), have shown their remarkable capability in learning speaker embedding. However, they meanwhile bring a huge computational cost in storage size, processing, and memory. Discovering the specialized CNN that meets a specific constraint requires a substantial effort of human experts. Compared with hand-designed approaches, neural architecture search (NAS) appears as a practical technique in automating the manual architecture design process and has attracted increasing interest in spoken language processing tasks such as speaker recognition. In this paper, we propose EfficientTDNN, an efficient architecture search framework consisting of a TDNN-based supernet and a TDNN-NAS algorithm. The proposed supernet introduces temporal convolution of different ranges of the receptive field and feature aggregation of various resolutions from different layers to TDNN. On top of it, the TDNN-NAS algorithm quickly searches for the desired TDNN architecture via weight-sharing subnets, which surprisingly reduces computation while handling the vast number of devices with various resources requirements. Experimental results on the VoxCeleb dataset show the proposed EfficientTDNN enables approximate $10^{13}$ architectures concerning depth, kernel, and width. Considering different computation constraints, it achieves a 2.20% equal error rate (EER) with 204M multiply-accumulate operations (MACs), 1.41% EER with 571M MACs as well as 0.94% EER with 1.45G MACs. Comprehensive investigations suggest that the trained supernet generalizes subnets not sampled during training and obtains a favorable trade-off between accuracy and efficiency.

preprint2022arXiv

Heterogeneous Global Graph Neural Networks for Personalized Session-based Recommendation

Predicting the next interaction of a short-term interaction session is a challenging task in session-based recommendation. Almost all existing works rely on item transition patterns, and neglect the impact of user historical sessions while modeling user preference, which often leads to non-personalized recommendation. Additionally, existing personalized session-based recommenders capture user preference only based on the sessions of the current user, but ignore the useful item-transition patterns from other user's historical sessions. To address these issues, we propose a novel Heterogeneous Global Graph Neural Networks (HG-GNN) to exploit the item transitions over all sessions in a subtle manner for better inferring user preference from the current and historical sessions. To effectively exploit the item transitions over all sessions from users, we propose a novel heterogeneous global graph that contains item transitions of sessions, user-item interactions and global co-occurrence items. Moreover, to capture user preference from sessions comprehensively, we propose to learn two levels of user representations from the global graph via two graph augmented preference encoders. Specifically, we design a novel heterogeneous graph neural network (HGNN) on the heterogeneous global graph to learn the long-term user preference and item representations with rich semantics. Based on the HGNN, we propose the Current Preference Encoder and the Historical Preference Encoder to capture the different levels of user preference from the current and historical sessions, respectively. To achieve personalized recommendation, we integrate the representations of the user current preference and historical interests to generate the final user preference representation. Extensive experimental results on three real-world datasets show that our model outperforms other state-of-the-art methods.

preprint2022arXiv

Intention Adaptive Graph Neural Network for Category-aware Session-based Recommendation

Session-based recommendation (SBR) is proposed to recommend items within short sessions given that user profiles are invisible in various scenarios nowadays, such as e-commerce and short video recommendation. There is a common scenario that user specifies a target category of items as a global filter, however previous SBR settings mainly consider the item sequence and overlook the rich target category information. Therefore, we define a new task called Category-aware Session-Based Recommendation (CSBR), focusing on the above scenario, in which the user-specified category can be efficiently utilized by the recommendation system. To address the challenges of the proposed task, we develop a novel method called Intention Adaptive Graph Neural Network (IAGNN), which takes advantage of relationship between items and their categories to achieve an accurate recommendation result. Specifically, we construct a category-aware graph with both item and category nodes to represent the complex transition information in the session. An intention-adaptive graph neural network on the category-aware graph is utilized to capture user intention by transferring the historical interaction information to the user-specified category domain. Extensive experiments on three real-world datasets are conducted to show our IAGNN outperforms the state-of-the-art baselines in the new task.

preprint2022arXiv

LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT

Self-supervised speech representation learning has shown promising results in various speech processing tasks. However, the pre-trained models, e.g., HuBERT, are storage-intensive Transformers, limiting their scope of applications under low-resource settings. To this end, we propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically by pruning structured parameters. More precisely, we create a Transformer-based supernet that is nested with thousands of weight-sharing subnets and design a two-stage distillation strategy to leverage the contextualized latent representations from HuBERT. Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $10^9$ architectures concerning the embedding dimension, attention dimension, head number, feed-forward network ratio, and network depth. LightHuBERT outperforms the original HuBERT on ASR and five SUPERB tasks with the HuBERT size, achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters, and obtains a $3.5\times$ compression ratio in three SUPERB tasks, e.g., automatic speaker verification, keyword spotting, and intent classification, with a slight accuracy loss. The code and pre-trained models are available at https://github.com/mechanicalsea/lighthubert.

preprint2022arXiv

Multi-View Self-Attention Based Transformer for Speaker Recognition

Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Transformer variants for speaker recognition have not been well studied. In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition. Specifically, to balance the capabilities of capturing global dependencies and modeling the locality, we propose a multi-view self-attention mechanism for speaker Transformer, in which different attention heads can attend to different ranges of the receptive field. Furthermore, we introduce and compare five Transformer variants with different network architectures, embedding locations, and pooling methods to learn speaker embeddings. Experimental results on the VoxCeleb1 and VoxCeleb2 datasets show that the proposed multi-view self-attention mechanism achieves improvement in the performance of speaker recognition, and the proposed speaker Transformer network attains excellent results compared with state-of-the-art models.

preprint2022arXiv

Multiple Choice Questions based Multi-Interest Policy Learning for Conversational Recommendation

Conversational recommendation system (CRS) is able to obtain fine-grained and dynamic user preferences based on interactive dialogue. Previous CRS assumes that the user has a clear target item. However, for many users who resort to CRS, they might not have a clear idea about what they really like. Specifically, the user may have a clear single preference for some attribute types (e.g. color) of items, while for other attribute types, the user may have multiple preferences or even no clear preferences, which leads to multiple acceptable attribute instances (e.g. black and red) of one attribute type. Therefore, the users could show their preferences over items under multiple combinations of attribute instances rather than a single item with unique combination of all attribute instances. As a result, we first propose a more realistic CRS learning setting, namely Multi-Interest Multi-round Conversational Recommendation, where users may have multiple interests in attribute instance combinations and accept multiple items with partially overlapped combinations of attribute instances. To effectively cope with the new CRS learning setting, in this paper, we propose a novel learning framework namely, Multi-Choice questions based Multi-Interest Policy Learning . In order to obtain user preferences more efficiently, the agent generates multi-choice questions rather than binary yes/no ones on specific attribute instance. Besides, we propose a union set strategy to select candidate items instead of existing intersection set strategy in order to overcome over-filtering items during the conversation. Finally, we design a Multi-Interest Policy Learning module, which utilizes captured multiple interests of the user to decide next action, either asking attribute instances or recommending items. Extensive experimental results on four datasets verify the superiority of our method for the proposed setting.

preprint2022arXiv

SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. We release our code and model at https://github.com/microsoft/SpeechT5.

preprint2022arXiv

Triples-to-Text Generation with Reinforcement Learning Based Graph-augmented Neural Networks

Considering a collection of RDF triples, the RDF-to-text generation task aims to generate a text description. Most previous methods solve this task using a sequence-to-sequence model or using a graph-based model to encode RDF triples and to generate a text sequence. Nevertheless, these approaches fail to clearly model the local and global structural information between and within RDF triples. Moreover, the previous methods also face the non-negligible problem of low faithfulness of the generated text, which seriously affects the overall performance of these models. To solve these problems, we propose a model combining two new graph-augmented structural neural encoders to jointly learn both local and global structural information in the input RDF triples. To further improve text faithfulness, we innovatively introduce a reinforcement learning (RL) reward based on information extraction (IE). We first extract triples from the generated text using a pretrained IE model and regard the correct number of the extracted triples as the additional RL reward. Experimental results on two benchmark datasets demonstrate that our proposed model outperforms the state-of-the-art baselines, and the additional reinforcement learning reward does help to improve the faithfulness of the generated text.

preprint2021arXiv

Temporal aware Multi-Interest Graph Neural Network For Session-based Recommendation

Session-based recommendation (SBR) is a challenging task, which aims at recommending next items based on anonymous interaction sequences. Despite the superior performance of existing methods for SBR, there are still several limitations: (i) Almost all existing works concentrate on single interest extraction and fail to disentangle multiple interests of user, which easily results in suboptimal representations for SBR. (ii) Furthermore, previous methods also ignore the multi-form temporal information, which is significant signal to obtain current intention for SBR. To address the limitations mentioned above, we propose a novel method, called \emph{Temporal aware Multi-Interest Graph Neural Network} (TMI-GNN) to disentangle multi-interest and yield refined intention representations with the injection of two level temporal information. Specifically, by appending multiple interest nodes, we construct a multi-interest graph for current session, and adopt the GNNs to model the item-item relation to capture adjacent item transitions, item-interest relation to disentangle the multi-interests, and interest-item relation to refine the item representation. Meanwhile, we incorporate item-level time interval signals to guide the item information propagation, and interest-level time distribution information to assist the scattering of interest information. Experiments on three benchmark datasets demonstrate that TMI-GNN outperforms other state-of-the-art methods consistently.