Source author record

Ye Zhu

Ye Zhu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision cond-mat.mtrl-sci eess.AS Machine Learning Sound cond-mat.mes-hall eess.IV math.DS math.OC Multimedia nlin.AO physics.soc-ph

Catalog footprint

What is connected

18works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models

We present the LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ensures high-quality data and annotations. To validate the effectiveness of LEMAS-Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS-TTS, built upon a non-autoregressive flow-matching framework, leverages the dataset's massive scale and linguistic diversity to achieve robust zero-shot multilingual synthesis. Our proposed accent-adversarial training and CTC loss mitigate cross-lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS-Edit employs an autoregressive decoder-only architecture that formulates speech editing as a masked token infilling task. By exploiting precise word-level alignments to construct training masks and adopting adaptive decoding strategies, it achieves seamless, smooth-boundary speech editing with natural transitions. Experimental results demonstrate that models trained on LEMAS-Dataset deliver high-quality synthesis and editing performance, confirming the dataset's quality. We envision that this richly timestamp-annotated, fine-grained multilingual corpus will drive future advances in prompt-based speech generation systems.

preprint2026arXiv

MANGO:Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement

Current audio-driven 3D head generation methods mainly focus on single-speaker scenarios, lacking natural, bidirectional listen-and-speak interaction. Achieving seamless conversational behavior, where speaking and listening states transition fluidly remains a key challenge. Existing 3D conversational avatar approaches rely on error-prone pseudo-3D labels that fail to capture fine-grained facial dynamics. To address these limitations, we introduce a novel two-stage framework MANGO, which leveraging pure image-level supervision by alternately training to mitigate the noise introduced by pseudo-3D labels, thereby achieving better alignment with real-world conversational behaviors. Specifically, in the first stage, a diffusion-based transformer with a dual-audio interaction module models natural 3D motion from multi-speaker audio. In the second stage, we use a fast 3D Gaussian Renderer to generate high-fidelity images and provide 2D-level photometric supervision for the 3D motions through alternate training. Additionally, we introduce MANGO-Dialog, a high-quality dataset with over 50 hours of aligned 2D-3D conversational data across 500+ identities. Extensive experiments demonstrate that our method achieves exceptional accuracy and realism in modeling two-person 3D dialogue motion, significantly advancing the fidelity and controllability of audio-driven talking heads.

preprint2026arXiv

RAD: A Dataset and Benchmark for Real-Life Anomaly Detection with Robotic Observations

Anomaly detection is a core capability for robotic perception and industrial inspection, yet most existing benchmarks are collected under controlled conditions with fixed viewpoints and stable illumination, failing to reflect real deployment scenarios. We introduce RAD (Realistic Anomaly Detection), a robot-captured, multi-view dataset designed to stress pose variation, reflective materials, and viewpoint-dependent defect visibility. RAD covers 13 everyday object categories and four realistic defect types--scratched, missing, stained, and squeezed--captured from over 60 robot viewpoints per object under uncontrolled lighting. We benchmark a wide range of state-of-the-art approaches, including 2D feature-based methods, 3D reconstruction pipelines, and vision-language models (VLMs), under a pose-agnostic setting. Surprisingly, we find that mature 2D feature-embedding methods consistently outperform recent 3D and VLM-based approaches at the image level, while the performance gap narrows for pixel-level localization. Our analysis reveals that reflective surfaces, geometric symmetry, and sparse viewpoint coverage fundamentally limit current geometry-based and zero-shot methods. RAD establishes a challenging and realistic benchmark for robotic anomaly detection, highlighting critical open problems beyond controlled laboratory settings.

preprint2022arXiv

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

Despite the considerable progress in automatic abdominal multi-organ segmentation from CT/MRI scans in recent years, a comprehensive evaluation of the models' capabilities is hampered by the lack of a large-scale benchmark from diverse clinical scenarios. Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods. To mitigate the limitations, we present AMOS, a large-scale, diverse, clinical dataset for abdominal organ segmentation. AMOS provides 500 CT and 100 MRI scans collected from multi-center, multi-vendor, multi-modality, multi-phase, multi-disease patients, each with voxel-level annotations of 15 abdominal organs, providing challenging examples and test-bed for studying robust segmentation algorithms under diverse targets and scenarios. We further benchmark several state-of-the-art medical segmentation models to evaluate the status of the existing methods on this new challenging dataset. We have made our datasets, benchmark servers, and baselines publicly available, and hope to inspire future research. Information can be found at https://amos22.grand-challenge.org.

preprint2022arXiv

Leveraging Cross Feedback of User and Item Embeddings with Attention for Variational Autoencoder based Collaborative Filtering

Matrix factorization (MF) has been widely applied to collaborative filtering in recommendation systems. Its Bayesian variants can derive posterior distributions of user and item embeddings, and are more robust to sparse ratings. However, the Bayesian methods are restricted by their update rules for the posterior parameters due to the conjugacy of the priors and the likelihood. Variational autoencoders (VAE) can address this issue by capturing complex mappings between the posterior parameters and the data. However, current research on VAEs for collaborative filtering only considers the mappings based on the explicit data information while the implicit embedding information is overlooked. In this paper, we first derive evidence lower bounds (ELBO) for Bayesian MF models from two viewpoints: user-oriented and item-oriented. Based on the ELBOs, we propose a VAE-based Bayesian MF framework. It leverages not only the data but also the embedding information to approximate the user-item joint distribution. As suggested by the ELBOs, the approximation is iterative with cross feedback of user and item embeddings into each other's encoders. More specifically, user embeddings sampled at the previous iteration are fed to the item-side encoders to estimate the posterior parameters for the item embeddings at the current iteration, and vice versa. The estimation also attends to the cross-fed embeddings to further exploit useful information. The decoder then reconstructs the data via the matrix factorization over the currently re-sampled user and item embeddings.

preprint2022arXiv

Modelling host population support for combat adversaries

We consider a model of adversarial dynamics consisting of three populations, labelled Blue, Green and Red, which evolve under a system of first order nonlinear differential equations. Red and Blue populations are adversaries and interact via a set of Lanchester combat laws. Green is divided into three sub-populations: Red supporters, Blue supporters and Neutral. Green support for Red and Blue leads to more combat effectiveness for either side. From Green's perspective, if either Red or Blue exceed a size according to the capacity of the local population to facilitate or tolerate, then support for that side diminishes; the corresponding Green population reverts to the neutral sub-population, who do not contribute to combat effectiveness of either side. The mechanism for supporters deciding if either Blue or Red are too big is given by a logistic-type interaction term. The intent of the model is to examine the role of influence in complex adversarial situations typical in counter-insurgency, where victory requires a genuine balance between maintaining combat effectiveness and support from a third party whose backing is not always assured.

preprint2022arXiv

Point-Set Kernel Clustering

Measuring similarity between two objects is the core operation in existing clustering algorithms in grouping similar objects into clusters. This paper introduces a new similarity measure called point-set kernel which computes the similarity between an object and a set of objects. The proposed clustering procedure utilizes this new measure to characterize every cluster grown from a seed object. We show that the new clustering procedure is both effective and efficient that enables it to deal with large scale datasets. In contrast, existing clustering algorithms are either efficient or effective. In comparison with the state-of-the-art density-peak clustering and scalable kernel k-means clustering, we show that the proposed algorithm is more effective and runs orders of magnitude faster when applying to datasets of millions of data points, on a commonly used computing machine.

preprint2022arXiv

Quantized GAN for Complex Music Generation from Dance Videos

We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates complex musical samples conditioned on dance videos. Our proposed framework takes dance video frames and human body motions as input, and learns to generate music samples that plausibly accompany the corresponding input. Unlike most existing conditional music generation works that generate specific types of mono-instrumental sounds using symbolic audio representations (e.g., MIDI), and that usually rely on pre-defined musical synthesizers, in this work we generate dance music in complex styles (e.g., pop, breaking, etc.) by employing a Vector Quantized (VQ) audio representation, and leverage both its generality and high abstraction capacity of its symbolic and continuous counterparts. By performing an extensive set of experiments on multiple datasets, and following a comprehensive evaluation protocol, we assess the generative qualities of our proposal against alternatives. The attained quantitative results, which measure the music consistency, beats correspondence, and music diversity, demonstrate the effectiveness of our proposed method. Last but not least, we curate a challenging dance-music dataset of in-the-wild TikTok videos, which we use to further demonstrate the efficacy of our approach in real-world applications -- and which we hope to serve as a starting point for relevant future research.

preprint2022arXiv

Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition

Action recognition has been a heated topic in computer vision for its wide application in vision systems. Previous approaches achieve improvement by fusing the modalities of the skeleton sequence and RGB video. However, such methods have a dilemma between the accuracy and efficiency for the high complexity of the RGB video network. To solve the problem, we propose a multi-modality feature fusion network to combine the modalities of the skeleton sequence and RGB frame instead of the RGB video, as the key information contained by the combination of skeleton sequence and RGB frame is close to that of the skeleton sequence and RGB video. In this way, the complementary information is retained while the complexity is reduced by a large margin. To better explore the correspondence of the two modalities, a two-stage fusion framework is introduced in the network. In the early fusion stage, we introduce a skeleton attention module that projects the skeleton sequence on the single RGB frame to help the RGB frame focus on the limb movement regions. In the late fusion stage, we propose a cross-attention module to fuse the skeleton feature and the RGB feature by exploiting the correlation. Experiments on two benchmarks NTU RGB+D and SYSU show that the proposed model achieves competitive performance compared with the state-of-the-art methods while reduces the complexity of the network.

preprint2022arXiv

Toward Clinically Assisted Colorectal Polyp Recognition via Structured Cross-modal Representation Consistency

The colorectal polyps classification is a critical clinical examination. To improve the classification accuracy, most computer-aided diagnosis algorithms recognize colorectal polyps by adopting Narrow-Band Imaging (NBI). However, the NBI usually suffers from missing utilization in real clinic scenarios since the acquisition of this specific image requires manual switching of the light mode when polyps have been detected by using White-Light (WL) images. To avoid the above situation, we propose a novel method to directly achieve accurate white-light colonoscopy image classification by conducting structured cross-modal representation consistency. In practice, a pair of multi-modal images, i.e. NBI and WL, are fed into a shared Transformer to extract hierarchical feature representations. Then a novel designed Spatial Attention Module (SAM) is adopted to calculate the similarities between the class token and patch tokens %from multi-levels for a specific modality image. By aligning the class tokens and spatial attention maps of paired NBI and WL images at different levels, the Transformer achieves the ability to keep both global and local representation consistency for the above two modalities. Extensive experimental results illustrate the proposed method outperforms the recent studies with a margin, realizing multi-modal prediction with a single Transformer while greatly improving the classification accuracy when only with WL images.

preprint2021arXiv

Learning Audio-Visual Correlations from Variational Cross-Modal Generation

People can easily imagine the potential sound while seeing an event. This natural synchronization between audio and visual signals reveals their intrinsic correlations. To this end, we propose to learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner, the learned correlations can be then readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval. We introduce a novel Variational AutoEncoder (VAE) framework that consists of Multiple encoders and a Shared decoder (MS-VAE) with an additional Wasserstein distance constraint to tackle the problem. Extensive experiments demonstrate that the optimized latent representation of the proposed MS-VAE can effectively learn the audio-visual correlations and can be readily applied in multiple audio-visual downstream tasks to achieve competitive performance even without any given label information during training.

preprint2020arXiv

Adversarial decision strategies in multiple network phased oscillators: the Blue-Green-Red Kuramoto-Sakaguchi model

We consider a model of three interacting sets of decision-making agents, labeled Blue, Green and Red, represented as coupled phased oscillators subject to frustrated synchronisation dynamics. The agents are coupled on three networks of differing topologies, with interactions modulated by different cross-population frustrations, internal and cross-network couplings. The intent of the dynamic model is to examine the degree to which two of the groups of decision-makers, Blue and Red, are able to realise a strategy of being ahead of each others' decision-making cycle while internally seeking synchronisation of this process -- all in the context of further interactions with the third population, Green. To enable this analysis, we perform a significant dimensional reduction approximation and stability analysis. We compare this to a numerical solution for a range of internal and cross-network coupling parameters to investigate various synchronisation regimes and critical thresholds. The comparison reveals good agreement for appropriate parameter ranges. Performing parameter sweeps, we reveal that Blue's pursuit of a strategy of staying too-far ahead of Red's decision cycles triggers a second-order effect of the Green population being ahead of Blue's cycles. This behaviour has implications for the dynamics of multiple interacting social groups with both cooperative and competitive processes.

preprint2020arXiv

Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents

With the arising concerns for the AI systems provided with direct access to abundant sensitive information, researchers seek to develop more reliable AI with implicit information sources. To this end, in this paper, we introduce a new task called video description via two multi-modal cooperative dialog agents, whose ultimate goal is for one conversational agent to describe an unseen video based on the dialog and two static frames. Specifically, one of the intelligent agents - Q-BOT - is given two static frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions before describing the unseen video. A-BOT, the other agent who has already seen the entire video, assists Q-BOT to accomplish the goal by providing answers to those questions. We propose a QA-Cooperative Network with a dynamic dialog history update learning mechanism to transfer knowledge from A-BOT to Q-BOT, thus helping Q-BOT to better describe the video. Extensive experiments demonstrate that Q-BOT can effectively learn to describe an unseen video by the proposed model and the cooperative learning method, achieving the promising performance where Q-BOT is given the full ground truth history dialog.

preprint2020arXiv

Hierarchical HMM for Eye Movement Classification

In this work, we tackle the problem of ternary eye movement classification, which aims to separate fixations, saccades and smooth pursuits from the raw eye positional data. The efficient classification of these different types of eye movements helps to better analyze and utilize the eye tracking data. Different from the existing methods that detect eye movement by several pre-defined threshold values, we propose a hierarchical Hidden Markov Model (HMM) statistical algorithm for detecting fixations, saccades and smooth pursuits. The proposed algorithm leverages different features from the recorded raw eye tracking data with a hierarchical classification strategy, separating one type of eye movement each time. Experimental results demonstrate the effectiveness and robustness of the proposed method by achieving competitive or better performance compared to the state-of-the-art methods.

preprint2016arXiv

Revisiting copy-move forgery detection by considering realistic image with similar but genuine objects

Many images, of natural or man-made scenes often contain Similar but Genuine Objects (SGO). This poses a challenge to existing Copy-Move Forgery Detection (CMFD) methods which match the key points / blocks, solely based on the pair similarity in the scene. To address such issue, we propose a novel CMFD method using Scaled Harris Feature Descriptors (SHFD) that preform consistently well on forged images with SGO. It involves the following main steps: (i) Pyramid scale space and orientation assignment are used to keep scaling and rotation invariance; (ii) Combined features are applied for precise texture description; (iii) Similar features of two points are matched and RANSAC is used to remove the false matches. The experimental results indicate that the proposed algorithm is effective in detecting SGO and copy-move forgery, which compares favorably to existing methods. Our method exhibits high robustness even when an image is operated by geometric transformation and post-processing

preprint2014arXiv

Atomically precise interfaces from non-stoichiometric deposition

Complex oxide heterostructures display some of the most chemically abrupt, atomically precise interfaces, which is advantageous when constructing new interface phases with emergent properties by juxtaposing incompatible ground states. One might assume that atomically precise interfaces result from stoichiometric growth, but here we show that the most precise control is obtained for non-stoichiometric growth where differing surface energies can be compensated by surfactant-like effects. For the precise growth of Sr$_{n+1}$Ti$_n$O$_{3n+1}$ Ruddlesden-Popper (RP) phases, stoichiometric deposition leads to the loss of the first RP rock-salt double layer, but growing with a strontium-rich surface layer restores the bulk stoichiometry and ordering of the subsurface RP structure. Our results dramatically expand the materials that can be prepared in epitaxial heterostructures with precise interface control---from just the $n=\infty$ end members (perovskites) to the entire RP family---enabling the exploration of novel quantum phenomena at a richer variety of oxide interfaces.

preprint2012arXiv

Determining On-Axis Crystal Thickness with Quantitative Position-Averaged Incoherent Bright-Field Signal in an Aberration-corrected STEM

An accurate determination of specimen thickness is essential for quantitative analytical electron microscopy. Here we demonstrate that a position-averaged incoherent bright-field signal recorded on an absolute scale can be used to determine the thickness of on-axis crystals with a precision of \pm1.6 nm. This method measures both the crystalline and the non-crystalline parts (surface amorphous layers) of the sample. However, it avoids the systematic error resulting from surface plasmons contributions to the inelastic mean free path thickness estimated by electron energy loss spectroscopy.

preprint2010arXiv

Imaging Grains and Grain Boundaries in Single-Layer Graphene: An Atomic Patchwork Quilt

The properties of polycrystalline materials are often dominated by the size of their grains and by the atomic structure of their grain boundaries. These effects should be especially pronounced in 2D materials, where even a line defect can divide and disrupt a crystal. These issues take on practical significance in graphene, a hexagonal two-dimensional crystal of carbon atoms; Single-atom-thick graphene sheets can now be produced by chemical vapor deposition on up to meter scales, making their polycrystallinity almost unavoidable. Theoretically, graphene grain boundaries are predicted to have distinct electronic, magnetic, chemical, and mechanical properties which strongly depend on their atomic arrangement. Yet, because of the five-order-of-magnitude size difference between grains and the atoms at grain boundaries, few experiments have fully explored the graphene grain structure. Here, we use a combination of old and new transmission electron microscope techniques to bridge these length scales. Using atomic-resolution imaging, we determine the location and identity of every atom at a grain boundary and find that different grains stitch together predominantly via pentagon-heptagon pairs. We then use diffraction-filtered imaging to rapidly map the location, orientation, and shape of several hundred grains and boundaries, where only a handful have been previously reported. The resulting images reveal an unexpectedly small and intricate patchwork of grains connected by tilt boundaries. By correlating grain imaging with scanned probe measurements, we show that these grain boundaries dramatically weaken the mechanical strength of graphene membranes, but do not measurably alter their electrical properties. These techniques open a new window for studies on the structure, properties, and control of grains and grain boundaries in graphene and other 2D materials.

Ye Zhu

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models

MANGO:Natural Multi-speaker 3D Talking Head Generation via 2D-Lifted Enhancement

RAD: A Dataset and Benchmark for Real-Life Anomaly Detection with Robotic Observations

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

Leveraging Cross Feedback of User and Item Embeddings with Attention for Variational Autoencoder based Collaborative Filtering

Modelling host population support for combat adversaries

Point-Set Kernel Clustering

Quantized GAN for Complex Music Generation from Dance Videos

Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition

Toward Clinically Assisted Colorectal Polyp Recognition via Structured Cross-modal Representation Consistency

Learning Audio-Visual Correlations from Variational Cross-Modal Generation

Adversarial decision strategies in multiple network phased oscillators: the Blue-Green-Red Kuramoto-Sakaguchi model

Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents

Hierarchical HMM for Eye Movement Classification

Revisiting copy-move forgery detection by considering realistic image with similar but genuine objects

Atomically precise interfaces from non-stoichiometric deposition

Determining On-Axis Crystal Thickness with Quantitative Position-Averaged Incoherent Bright-Field Signal in an Aberration-corrected STEM

Imaging Grains and Grain Boundaries in Single-Layer Graphene: An Atomic Patchwork Quilt