Researcher profile

Xin-Shun Xu

Xin-Shun Xu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2026arXiv

Multi-granularity Interactive Attention Framework for Residual Hierarchical Pronunciation Assessment

Automatic pronunciation assessment plays a crucial role in computer-assisted pronunciation training systems. Due to the ability to perform multiple pronunciation tasks simultaneously, multi-aspect multi-granularity pronunciation assessment methods are gradually receiving more attention and achieving better performance than single-level modeling tasks. However, existing methods only consider unidirectional dependencies between adjacent granularity levels, lacking bidirectional interaction among phoneme, word, and utterance levels and thus insufficiently capturing the acoustic structural correlations. To address this issue, we propose a novel residual hierarchical interactive method, HIA for short, that enables bidirectional modeling across granularities. As the core of HIA, the Interactive Attention Module leverages an attention mechanism to achieve dynamic bidirectional interaction, effectively capturing linguistic features at each granularity while integrating correlations between different granularity levels. We also propose a residual hierarchical structure to alleviate the feature forgetting problem when modeling acoustic hierarchies. In addition, we use 1-D convolutional layers to enhance the extraction of local contextual cues at each granularity. Extensive experiments on the speechocean762 dataset show that our model is comprehensively ahead of the existing state-of-the-art methods.

preprint2022arXiv

Online Enhanced Semantic Hashing: Towards Effective and Efficient Retrieval for Streaming Multi-Modal Data

With the vigorous development of multimedia equipment and applications, efficient retrieval of large-scale multi-modal data has become a trendy research topic. Thereinto, hashing has become a prevalent choice due to its retrieval efficiency and low storage cost. Although multi-modal hashing has drawn lots of attention in recent years, there still remain some problems. The first point is that existing methods are mainly designed in batch mode and not able to efficiently handle streaming multi-modal data. The second point is that all existing online multi-modal hashing methods fail to effectively handle unseen new classes which come continuously with streaming data chunks. In this paper, we propose a new model, termed Online enhAnced SemantIc haShing (OASIS). We design novel semantic-enhanced representation for data, which could help handle the new coming classes, and thereby construct the enhanced semantic objective function. An efficient and effective discrete online optimization algorithm is further proposed for OASIS. Extensive experiments show that our method can exceed the state-of-the-art models. For good reproducibility and benefiting the community, our code and data are already available in supplementary material and will be made publicly available.

preprint2022arXiv

Three-Stream Joint Network for Zero-Shot Sketch-Based Image Retrieval

The Zero-Shot Sketch-based Image Retrieval (ZS-SBIR) is a challenging task because of the large domain gap between sketches and natural images as well as the semantic inconsistency between seen and unseen categories. Previous literature bridges seen and unseen categories by semantic embedding, which requires prior knowledge of the exact class names and additional extraction efforts. And most works reduce domain gap by mapping sketches and natural images into a common high-level space using constructed sketch-image pairs, which ignore the unpaired information between images and sketches. To address these issues, in this paper, we propose a novel Three-Stream Joint Training Network (3JOIN) for the ZS-SBIR task. To narrow the domain differences between sketches and images, we extract edge maps for natural images and treat them as a bridge between images and sketches, which have similar content to images and similar style to sketches. For exploiting a sufficient combination of sketches, natural images, and edge maps, a novel three-stream joint training network is proposed. In addition, we use a teacher network to extract the implicit semantics of the samples without the aid of other semantics and transfer the learned knowledge to unseen classes. Extensive experiments conducted on two real-world datasets demonstrate the superiority of our proposed method.

preprint2022arXiv

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Recently, several Vision Transformer (ViT) based methods have been proposed for Fine-Grained Visual Classification (FGVC).These methods significantly surpass existing CNN-based ones, demonstrating the effectiveness of ViT in FGVC tasks.However, there are some limitations when applying ViT directly to FGVC.First, ViT needs to split images into patches and calculate the attention of every pair, which may result in heavy redundant calculation and unsatisfying performance when handling fine-grained images with complex background and small objects.Second, a standard ViT only utilizes the class token in the final layer for classification, which is not enough to extract comprehensive fine-grained information. To address these issues, we propose a novel ViT based fine-grained object discriminator for FGVC tasks, ViT-FOD for short. Specifically, besides a ViT backbone, it further introduces three novel components, i.e, Attention Patch Combination (APC), Critical Regions Filter (CRF), and Complementary Tokens Integration (CTI). Thereinto, APC pieces informative patches from two images to generate a new image so that the redundant calculation can be reduced. CRF emphasizes tokens corresponding to discriminative regions to generate a new class token for subtle feature learning. To extract comprehensive information, CTI integrates complementary information captured by class tokens in different ViT layers. We conduct comprehensive experiments on widely used datasets and the results demonstrate that ViT-FOD is able to achieve state-of-the-art performance.

preprint2014arXiv

Optimized Cartesian $K$-Means

Product quantization-based approaches are effective to encode high-dimensional data points for approximate nearest neighbor search. The space is decomposed into a Cartesian product of low-dimensional subspaces, each of which generates a sub codebook. Data points are encoded as compact binary codes using these sub codebooks, and the distance between two data points can be approximated efficiently from their codes by the precomputed lookup tables. Traditionally, to encode a subvector of a data point in a subspace, only one sub codeword in the corresponding sub codebook is selected, which may impose strict restrictions on the search accuracy. In this paper, we propose a novel approach, named Optimized Cartesian $K$-Means (OCKM), to better encode the data points for more accurate approximate nearest neighbor search. In OCKM, multiple sub codewords are used to encode the subvector of a data point in a subspace. Each sub codeword stems from different sub codebooks in each subspace, which are optimally generated with regards to the minimization of the distortion errors. The high-dimensional data point is then encoded as the concatenation of the indices of multiple sub codewords from all the subspaces. This can provide more flexibility and lower distortion errors than traditional methods. Experimental results on the standard real-life datasets demonstrate the superiority over state-of-the-art approaches for approximate nearest neighbor search.