Source author record

Quan Wang

Quan Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

48works

26topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

On the global stability and large time behavior of solutions of the Boussinesq equations

We study the two dimensional viscous Boussinesq equations, which model stratified flows in a circular domain under the influence of a general gravitational potential $f$. First, we show that the Boussinesq equations admit steady-state solutions only in the form of hydrostatic equilibria, $(\mathbf{u},ρ,p) = (0, ρ_s, p_s)$, where the pressure gradient satisfies $\nabla p_s = -ρ_s \nabla f$. Moreover, the relation between $ρ_s$ and $f$ is constrained by $(\partial_y ρ_s, -\partial_x ρ_s) \cdot (\partial_x f, \partial_y f) = 0$, which allows us to write $\nabla ρ_s = h(x,y) \nabla f$ for some scalar function $h(x,y)$. Second, we prove that any hydrostatic equilibrium $(0, ρ_s, p_s)$ is linearly unstable if $h(x_0, y_0) > 0$ at some point $(x, y) = (x_0, y_0)$. This instability coincides with the classical Rayleigh--Taylor instability. Third, by employing a series of regularity estimates, we reveal that although the presence of the Rayleigh--Taylor instability makes perturbations around the unstable equilibrium grow exponentially in time, the system ultimately converges to a state of hydrostatic equilibrium. The analysis is carried out for perturbations about an arbitrary hydrostatic equilibrium, covering both stable and unstable configurations. Finally, we derive a necessary and sufficient condition on the initial density perturbation under which the density converges to a profile of the form $-γf + β$ with constants $γ, β> 0$. This result underscores the system's inherent tendency to settle into a hydrostatic state, even in the presence of Rayleigh--Taylor instability.

preprint2026arXiv

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented architectures, cascaded pipelines, and misaligned representation spaces. We argue that this divide is not merely an engineering artifact, but a structural limitation that hinders the emergence of native multimodal intelligence. Hence, we introduce SenseNova-U1, a native unified multimodal paradigm built upon NEO-unify, in which understanding and generation evolve as synergistic views of a single underlying process. We launch two native unified variants, SenseNova-U1-8B-MoT and SenseNova-U1-A3B-MoT, built on dense (8B) and mixture-of-experts (30B-A3B) understanding baselines, respectively. Designed from first principles, they rival top-tier understanding-only VLMs across text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. Meanwhile, they deliver strong semantic consistency and visual fidelity, excelling in conventional or knowledge-intensive any-to-image (X2I) synthesis, complex text-rich infographic generation, and interleaved vision-language generation, with or without think patterns. Beyond performance, we show detailed model design, data preprocessing, pre-/post-training, and inference strategies to support community research. Last but not least, preliminary evidence demonstrates that our models extend beyond perception and generation, performing strongly in vision-language-action (VLA) and world model (WM) scenarios. This points toward a broader roadmap where models do not translate between modalities, but think and act across them in a native manner. Multimodal AI is no longer about connecting separate systems, but about building a unified one and trusting the necessary capabilities to emerge from within.

preprint2025arXiv

Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence (SI). We thus propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence. EASI conceptualizes a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and a growing collection of newly curated ones, enabling systematic evaluation of state-of-the-art models. In this report, we conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in SI, yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail the most advanced multimodal models. EASI is an ongoing community effort: we have open-sourced the EASI codebase that provides a one-stop and reproducible solution with standardized interfaces, integrated protocols and prompts that significantly reduce the friction of configuring and running multiple benchmarks; we have also launched an accompanying EASI leaderboard to provide a continually updated snapshot of model performance across the full SI spectrum, accelerating collective progress toward robust SI.

preprint2025arXiv

Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning

Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images. A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT). Despite recent progress, we observe that a significant portion of failures do not stem from clearly incorrect predictions, but from decision ambiguity, where the model assigns similar confidence to the correct answer and strong distractors. To formalize this challenge, we define Decision-Ambiguous Samples (DAS) as instances with a small probability margin between the ground-truth answer and the most competitive alternative. We argue that explicitly optimizing DAS is crucial for improving the discriminability and robustness of CDVQA models. To this end, we propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset. By leveraging multi-sample decoding and intra-group relative advantages, DARFT suppresses strong distractors and sharpens decision boundaries without additional supervision. Extensive experiments demonstrate consistent gains over SFT baselines, particularly under few-shot settings.

preprint2024arXiv

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

While large language models (LLMs) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. Finally, we automate the entire evaluation process to facilitate further developments. Different from existing studies on controllable text generation, CoDI-Eval extends the scope to the prevalent instruction-following paradigm for the first time. We provide extensive evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source LLMs. We believe this benchmark will facilitate research into improving the controllability of LLMs' responses to instructions. Our data and code are available at https://github.com/Xt-cyh/CoDI-Eval.

preprint2024arXiv

Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering

While recent research advances in speaker diarization mostly focus on improving the quality of diarization results, there is also an increasing interest in improving the efficiency of diarization systems. In this paper, we demonstrate that a multi-stage clustering strategy that uses different clustering algorithms for input of different lengths can address multi-faceted challenges of on-device speaker diarization applications. Specifically, a fallback clusterer is used to handle short-form inputs; a main clusterer is used to handle medium-length inputs; and a pre-clusterer is used to compress long-form inputs before they are processed by the main clusterer. Both the main clusterer and the pre-clusterer can be configured with an upper bound of the computational complexity to adapt to devices with different resource constraints. This multi-stage clustering strategy is critical for streaming on-device speaker diarization systems, where the budgets of CPU, memory and battery are tight.

preprint2024arXiv

USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models

We introduce a multilingual speaker change detection model (USM-SCD) that can simultaneously detect speaker turns and perform ASR for 96 languages. This model is adapted from a speech foundation model trained on a large quantity of supervised and unsupervised data, demonstrating the utility of fine-tuning from a large generic foundation model for a downstream task. We analyze the performance of this multilingual speaker change detection model through a series of ablation studies. We show that the USM-SCD model can achieve more than 75% average speaker change detection F1 score across a test set that consists of data from 96 languages. On American English, the USM-SCD model can achieve an 85.8% speaker change detection F1 score across various public and internal test sets, beating the previous monolingual baseline model by 21% relative. We also show that we only need to fine-tune one-quarter of the trainable model parameters to achieve the best model performance. The USM-SCD model exhibits state-of-the-art ASR quality compared with a strong public ASR baseline, making it suitable to handle both tasks with negligible additional computational cost.

preprint2022arXiv

A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation

Recent work has shown that it is possible to train a single model to perform joint acoustic echo cancellation (AEC), speech enhancement, and voice separation, thereby serving as a unified frontend for robust automatic speech recognition (ASR). The joint model uses contextual information, such as a reference of the playback audio, noise context, and speaker embedding. In this work, we propose a number of novel improvements to such a model. First, we improve the architecture of the Cross-Attention Conformer that is used to ingest noise context into the model. Second, we generalize the model to be able to handle varying lengths of noise context. Third, we propose Signal Dropout, a novel strategy that models missing contextual information. In the absence of one or more signals, the proposed model performs nearly as well as task-specific models trained without these signals; and when such signals are present, our system compares well against systems that require all context signals. Over the baseline, the final model retains a relative word error rate reduction of 25.0% on background speech when speaker embedding is absent, and 61.2% on AEC when device playback is absent.

preprint2022arXiv

Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech

In this paper, we introduce a novel language identification system based on conformer layers. We propose an attentive temporal pooling mechanism to allow the model to carry information in long-form audio via a recurrent form, such that the inference can be performed in a streaming fashion. Additionally, we investigate two domain adaptation approaches to allow adapting an existing language identification model without retraining the model parameters for a new domain. We perform a comparative study of different model topologies under different constraints of model size, and find that conformer-based models significantly outperform LSTM and transformer based models. Our experiments also show that attentive temporal pooling and domain adaptation improve model accuracy.

preprint2022arXiv

Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory

Driving 3D characters to dance following a piece of music is highly challenging due to the spatial constraints applied to poses by choreography norms. In addition, the generated dance sequence also needs to maintain temporal coherency with different music genres. To tackle these challenges, we propose a novel music-to-dance framework, Bailando, with two powerful components: 1) a choreographic memory that learns to summarize meaningful dancing units from 3D pose sequence to a quantized codebook, 2) an actor-critic Generative Pre-trained Transformer (GPT) that composes these units to a fluent dance coherent to the music. With the learned choreographic memory, dance generation is realized on the quantized units that meet high choreography standards, such that the generated dancing sequences are confined within the spatial constraints. To achieve synchronized alignment between diverse motion tempos and music beats, we introduce an actor-critic-based reinforcement learning scheme to the GPT with a newly-designed beat-align reward function. Extensive experiments on the standard benchmark demonstrate that our proposed framework achieves state-of-the-art performance both qualitatively and quantitatively. Notably, the learned choreographic memory is shown to discover human-interpretable dancing-style poses in an unsupervised manner.

preprint2022arXiv

Building Chinese Biomedical Language Models via Multi-Level Text Discrimination

Pre-trained language models (PLMs), such as BERT and GPT, have revolutionized the field of NLP, not only in the general domain but also in the biomedical domain. Most prior efforts in building biomedical PLMs have resorted simply to domain adaptation and focused mainly on English. In this work we introduce eHealth, a Chinese biomedical PLM built from scratch with a new pre-training framework. This new framework pre-trains eHealth as a discriminator through both token- and sequence-level discrimination. The former is to detect input tokens corrupted by a generator and recover their original identities from plausible candidates, while the latter is to further distinguish corruptions of a same original sequence from those of others. As such, eHealth can learn language semantics at both token and sequence levels. Extensive experiments on 11 Chinese biomedical language understanding tasks of various forms verify the effectiveness and superiority of our approach. We release the pre-trained model at \url{https://github.com/PaddlePaddle/Research/tree/master/KG/eHealth} and will also release the code later.

preprint2022arXiv

Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

VoiceFilter-Lite is a speaker-conditioned voice separation model that plays a crucial role in improving speech recognition and speaker verification by suppressing overlapping speech from non-target speakers. However, one limitation of VoiceFilter-Lite, and other speaker-conditioned speech models in general, is that these models are usually limited to a single target speaker. This is undesirable as most smart home devices now support multiple enrolled users. In order to extend the benefits of personalization to multiple users, we previously developed an attention-based speaker selection mechanism and applied it to VoiceFilter-Lite. However, the original multi-user VoiceFilter-Lite model suffers from significant performance degradation compared with single-user models. In this paper, we devised a series of experiments to improve the multi-user VoiceFilter-Lite model. By incorporating a dual learning rate schedule and by using feature-wise linear modulation (FiLM) to condition the model with the attended speaker embedding, we successfully closed the performance gap between multi-user and single-user VoiceFilter-Lite models on single-speaker evaluations. At the same time, the new model can also be easily extended to support any number of users, and significantly outperforms our previously published model on multi-speaker evaluations.

preprint2022arXiv

Compact and Robust Deep Learning Architecture for Fluorescence Lifetime Imaging and FPGA Implementation

This paper reported a bespoke adder-based deep learning network for time-domain fluorescence lifetime imaging (FLIM). By leveraging the l1-norm extraction method, we propose a 1-D Fluorescence Lifetime AdderNet (FLAN) without multiplication-based convolutions to reduce the computational complexity. Further, we compressed fluorescence decays in temporal dimension using a log-scale merging technique to discard redundant temporal information derived as log-scaling FLAN (FLAN+LS). FLAN+LS achieves 0.11 and 0.23 compression ratios compared with FLAN and a conventional 1-D convolutional neural network (1-D CNN) while maintaining high accuracy in retrieving lifetimes. We extensively evaluated FLAN and FLAN+LS using synthetic and real data. A traditional fitting method and other non-fitting, high-accuracy algorithms were compared with our networks for synthetic data. Our networks attained a minor reconstruction error in different photon-count scenarios. For real data, we used fluorescent beads' data acquired by a confocal microscope to validate the effectiveness of real fluorophores, and our networks can differentiate beads with different lifetimes. Additionally, we implemented the network architecture on a field-programmable gate array (FPGA) with a post-quantization technique to shorten the bit-width, thereby improving computing efficiency. FLAN+LS on hardware achieves the highest computing efficiency compared to 1-D CNN and FLAN. We also discussed the applicability of our network and hardware architecture for other time-resolved biomedical applications using photon-efficient, time-resolved sensors.

preprint2022arXiv

Comparison of Two Methods for Calculating Magnetic Helicity in the Solar Corona

Duo to the large magnetic Reynolds number, the magnetic helicity originating from the solar interior can be carried away through the photosphere into the corona. However, the relationship between the accumulated magnetic helicity flux through the photosphere and the magnetic helicity in the corona is still unclear. By selecting 36 newly emerging active regions in the 23rd solar cycle, we apply optical flow methods to derive the accumulated magnetic helicity through the photosphere ($H_m^p$) by using the sequential longitudinal magnetograms, use nonlinear force-free field extrapolation to obtain the 3D coronal magnetic field, and adopt finite volume methods to calculate the instantaneous relative magnetic helicity in the corona ($H_m^c$) by using vector magnetograms. It is found that the local correlation tracking (LCT)-based $H_m^p$ is larger than $H_m^c$ in $1"$, and that the Differential Affine Velocity Estimator-based $H_m^p$ is more consistent with $H_m^c$ than the LCT-based $H_m^p$. $H_m^p$ is more consistent with $H_m^c$ in evaluation from $2"$ than from $1"$. Moreover, $H_m^c - H_m^p$ systematically shows consistency with the Hemispheric Helicity Rule (over 55\%), no matter which resolution and method are used. These estimations suggest that the consistency of $H_m^c$ and $H_m^p$ is partly dependent on the resolution of the magnetograms and the calculation methods.

preprint2022arXiv

CVSS Corpus and Massively Multilingual Speech-to-Speech Translation

We introduce CVSS, a massively multilingual-to-English speech-to-speech translation (S2ST) corpus, covering sentence-level parallel S2ST pairs from 21 languages into English. CVSS is derived from the Common Voice speech corpus and the CoVoST 2 speech-to-text translation (ST) corpus, by synthesizing the translation text from CoVoST 2 into speech using state-of-the-art TTS systems. Two versions of translation speeches are provided: 1) CVSS-C: All the translation speeches are in a single high-quality canonical voice; 2) CVSS-T: The translation speeches are in voices transferred from the corresponding source speeches. In addition, CVSS provides normalized translation text which matches the pronunciation in the translation speech. On each version of CVSS, we built baseline multilingual direct S2ST models and cascade S2ST models, verifying the effectiveness of the corpus. To build strong cascade S2ST baselines, we trained an ST model on CoVoST 2, which outperforms the previous state-of-the-art trained on the corpus without extra data by 5.8 BLEU. Nevertheless, the performance of the direct S2ST models approaches the strong cascade baselines when trained from scratch, and with only 0.1 or 0.7 BLEU difference on ASR transcribed translation when initialized from matching ST models.

preprint2022arXiv

Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition

Many neural network speaker recognition systems model each speaker using a fixed-dimensional embedding vector. These embeddings are generally compared using either linear or 2nd-order scoring and, until recently, do not handle utterance-specific uncertainty. In this work we propose scoring these representations in a way that can capture uncertainty, enroll/test asymmetry and additional non-linear information. This is achieved by incorporating a 2nd-stage neural network (known as a decision network) as part of an end-to-end training regimen. In particular, we propose the concept of decision residual networks which involves the use of a compact decision network to leverage cosine scores and to model the residual signal that's needed. Additionally, we present a modification to the generalized end-to-end softmax loss function to target the separation of same/different speaker scores. We observed significant performance gains for the two techniques.

preprint2022arXiv

Fast fluorescence lifetime imaging analysis via extreme learning machine

We present a fast and accurate analytical method for fluorescence lifetime imaging microscopy (FLIM) using the extreme learning machine (ELM). We used extensive metrics to evaluate ELM and existing algorithms. First, we compared these algorithms using synthetic datasets. Results indicate that ELM can obtain higher fidelity, even in low-photon conditions. Afterwards, we used ELM to retrieve lifetime components from human prostate cancer cells loaded with gold nanosensors, showing that ELM also outperforms the iterative fitting and non-fitting algorithms. By comparing ELM with a computational efficient neural network, ELM achieves comparable accuracy with less training and inference time. As there is no back-propagation process for ELM during the training phase, the training speed is much higher than existing neural network approaches. The proposed strategy is promising for edge computing with online training.

preprint2022arXiv

Keller-Segel model with Logarithmic Interaction and nonlocal reaction term

We investigate the global existence and blow-up of solutions to the Keller-Segel model with nonlocal reaction term $u\left(M_0-\int_{\R^2} u dx\right)$ in dimension two. By introducing a transformation in terms of the total mass of the populations to deal with the lack of mass conservation, we exhibit that the qualitative behavior of solutions is decided by a critical value $8π$ for the growth parameter $M_0$ and the initial mass $m_0$. For general solutions, if both $m_0$ and $M_0$ are less than $8π$, solutions exist globally in time using the energy inequality, whereas there are finite time blow-up solutions for $M_0>8π$ (It involves the case $m_0<8π$) with any initial data and $M_0<8π<m_0$ with small initial second moment. We also show the infinite time blow-up for the critical case $M_0=8 π.$ Moreover, in the radial context, we show that if the initial data $u_0(r)<\frac{m_0}{M_0} \frac{8 λ}{(r^2+λ)^2}$ for some $λ>0$, then all the radially symmetric solutions are vanishing in $L_{loc}^1(\R^2)$ as $t \to \infty$. If the initial data $u_0(r)>\frac{m_0}{M_0} \frac{8 λ}{(r^2+λ)^2}$ for some $λ>0$, then there could exist a radially symmetric solution satisfying a mass concentration at the origin as $t \to \infty.$

preprint2022arXiv

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Personalization of on-device speech recognition (ASR) has seen explosive growth in recent years, largely due to the increasing popularity of personal assistant features on mobile devices and smart home speakers. In this work, we present Personal VAD 2.0, a personalized voice activity detector that detects the voice activity of a target speaker, as part of a streaming on-device ASR system. Although previous proof-of-concept studies have validated the effectiveness of Personal VAD, there are still several critical challenges to address before this model can be used in production: first, the quality must be satisfactory in both enrollment and enrollment-less scenarios; second, it should operate in a streaming fashion; and finally, the model size should be small enough to fit a limited latency and CPU/Memory budget. To meet the multi-faceted requirements, we propose a series of novel designs: 1) advanced speaker embedding modulation methods; 2) a new training paradigm to generalize to enrollment-less conditions; 3) architecture and runtime optimizations for latency and resource restrictions. Extensive experiments on a realistic speech recognition system demonstrated the state-of-the-art performance of our proposed method.

preprint2022arXiv

Speaker Diarization with LSTM

For many years, i-vector based audio embedding techniques were the dominant approach for speaker verification and speaker diarization applications. However, mirroring the rise of deep learning in various domains, neural network based audio embeddings, also known as d-vectors, have consistently demonstrated superior speaker verification performance. In this paper, we build on the success of d-vector based speaker verification systems to develop a new d-vector based approach to speaker diarization. Specifically, we combine LSTM-based d-vector audio embeddings with recent work in non-parametric clustering to obtain a state-of-the-art speaker diarization system. Our system is evaluated on three standard public datasets, suggesting that d-vector based diarization systems offer significant advantages over traditional i-vector based systems. We achieved a 12.0% diarization error rate on NIST SRE 2000 CALLHOME, while our model is trained with out-of-domain data from voice search logs.

preprint2022arXiv

Structure-aware Editable Morphable Model for 3D Facial Detail Animation and Manipulation

Morphable models are essential for the statistical modeling of 3D faces. Previous works on morphable models mostly focus on large-scale facial geometry but ignore facial details. This paper augments morphable models in representing facial details by learning a Structure-aware Editable Morphable Model (SEMM). SEMM introduces a detail structure representation based on the distance field of wrinkle lines, jointly modeled with detail displacements to establish better correspondences and enable intuitive manipulation of wrinkle structure. Besides, SEMM introduces two transformation modules to translate expression blendshape weights and age values into changes in latent space, allowing effective semantic detail editing while maintaining identity. Extensive experiments demonstrate that the proposed model compactly represents facial details, outperforms previous methods in expression animation qualitatively and quantitatively, and achieves effective age editing and wrinkle line editing of facial details. Code and model are available at https://github.com/gerwang/facial-detail-manipulation.

preprint2022arXiv

Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection

In this paper, we present a novel speaker diarization system for streaming on-device applications. In this system, we use a transformer transducer to detect the speaker turns, represent each speaker turn by a speaker embedding, then cluster these embeddings with constraints from the detected speaker turns. Compared with conventional clustering-based diarization systems, our system largely reduces the computational cost of clustering due to the sparsity of speaker turns. Unlike other supervised speaker diarization systems which require annotations of time-stamped speaker labels for training, our system only requires including speaker turn tokens during the transcribing process, which largely reduces the human efforts involved in data collection.

preprint2021arXiv

Dynamical transition of hydromagnetic convection in a rotating fluid layer

In this article, we aim to study the stability and dynamic transition of an electrically conducting fluid in the presence of an external uniform horizontal magnetic field and rotation based on a Boussinesq approximation model. By analyzing the spectrum of the linear part of the model and verifying the validity of the principle of exchange of stability, we take a hybrid approach combining theoretical analysis with numerical computation to study the transition from a simple real eigenvalue, a pair of complex conjugate eigenvalues and a real eigenvalue of multiplicity two, respectively. The center manifold reduction theory is applied to reduce the infinite dimensional system to the corresponding finite dimensional one together with one or several non-dimensional transition numbers that determine the dynamic transition types. Careful numerical computations are performed to determine these transition numbers as well as related temporal and flow patterns etc. Our results indicate that both continuous and jump transitions can occur at certain parameter region.

preprint2021arXiv

Entity Structure Within and Throughout: Modeling Mention Dependencies for Document-Level Relation Extraction

Entities, as the essential elements in relation extraction tasks, exhibit certain structure. In this work, we formulate such structure as distinctive dependencies between mention pairs. We then propose SSAN, which incorporates these structural dependencies within the standard self-attention mechanism and throughout the overall encoding stage. Specifically, we design two alternative transformation modules inside each self-attention building block to produce attentive biases so as to adaptively regularize its attention flow. Our experiments demonstrate the usefulness of the proposed entity structure and the effectiveness of SSAN. It significantly outperforms competitive baselines, achieving new state-of-the-art results on three popular document-level relation extraction datasets. We further provide ablation and visualization to show how the entity structure guides the model for better relation extraction. Our code is publicly available.

preprint2020arXiv

A Comparative Study on Polyp Classification using Convolutional Neural Networks

Colorectal cancer is the third most common cancer diagnosed in both men and women in the United States. Most colorectal cancers start as a growth on the inner lining of the colon or rectum, called 'polyp'. Not all polyps are cancerous, but some can develop into cancer. Early detection and recognition of the type of polyps is critical to prevent cancer and change outcomes. However, visual classification of polyps is challenging due to varying illumination conditions of endoscopy, variant texture, appearance, and overlapping morphology between polyps. More importantly, evaluation of polyp patterns by gastroenterologists is subjective leading to a poor agreement among observers. Deep convolutional neural networks have proven very successful in object classification across various object categories. In this work, we compare the performance of the state-of-the-art general object classification models for polyp classification. We trained a total of six CNN models end-to-end using a dataset of 157 video sequences composed of two types of polyps: hyperplastic and adenomatous. Our results demonstrate that the state-of-the-art CNN models can successfully classify polyps with an accuracy comparable or better than reported among gastroenterologists. The results of this study can guide future research in polyp classification.

preprint2020arXiv

ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

Automatic speaker verification (ASV) is one of the most natural and convenient means of biometric person recognition. Unfortunately, just like all other biometric systems, ASV is vulnerable to spoofing, also referred to as "presentation attacks." These vulnerabilities are generally unacceptable and call for spoofing countermeasures or "presentation attack detection" systems. In addition to impersonation, ASV systems are vulnerable to replay, speech synthesis, and voice conversion attacks. The ASVspoof 2019 edition is the first to consider all three spoofing attack types within a single challenge. While they originate from the same source database and same underlying protocol, they are explored in two specific use case scenarios. Spoofing attacks within a logical access (LA) scenario are generated with the latest speech synthesis and voice conversion technologies, including state-of-the-art neural acoustic and waveform model techniques. Replay spoofing attacks within a physical access (PA) scenario are generated through carefully controlled simulations that support much more revealing analysis than possible previously. Also new to the 2019 edition is the use of the tandem detection cost function metric, which reflects the impact of spoofing and countermeasures on the reliability of a fixed ASV system. This paper describes the database design, protocol, spoofing attack implementations, and baseline ASV and countermeasure results. It also describes a human assessment on spoofed data in logical access. It was demonstrated that the spoofing data in the ASVspoof 2019 database have varied degrees of perceived quality and similarity to the target speakers, including spoofed data that cannot be differentiated from bona-fide utterances even by human subjects.

preprint2020arXiv

CoKE: Contextualized Knowledge Graph Embedding

Knowledge graph embedding, which projects symbolic entities and relations into continuous vector spaces, is gaining increasing attention. Previous methods allow a single static embedding for each entity or relation, ignoring their intrinsic contextual nature, i.e., entities and relations may appear in different graph contexts, and accordingly, exhibit different properties. This work presents Contextualized Knowledge Graph Embedding (CoKE), a novel paradigm that takes into account such contextual nature, and learns dynamic, flexible, and fully contextualized entity and relation embeddings. Two types of graph contexts are studied: edges and paths, both formulated as sequences of entities and relations. CoKE takes a sequence as input and uses a Transformer encoder to obtain contextualized representations. These representations are hence naturally adaptive to the input, capturing contextual meanings of entities and relations therein. Evaluation on a wide variety of public benchmarks verifies the superiority of CoKE in link prediction and path query answering. It performs consistently better than, or at least equally well as current state-of-the-art in almost every case, in particular offering an absolute improvement of 21.0% in H@10 on path query answering. Our code is available at \url{https://github.com/PaddlePaddle/Research/tree/master/KG/CoKE}.

preprint2020arXiv

Fast and Accurate: Structure Coherence Component for Face Alignment

In this paper, we propose a fast and accurate coordinate regression method for face alignment. Unlike most existing facial landmark regression methods which usually employ fully connected layers to convert feature maps into landmark coordinate, we present a structure coherence component to explicitly take the relation among facial landmarks into account. Due to the geometric structure of human face, structure coherence between different facial parts provides important cues for effectively localizing facial landmarks. However, the dense connection in the fully connected layers overuses such coherence, making the important cues unable to be distinguished from all connections. Instead, our structure coherence component leverages a dynamic sparse graph structure to passing features among the most related landmarks. Furthermore, we propose a novel objective function, named Soft Wing loss, to improve the accuracy. Extensive experiments on three popular benchmarks, including WFLW, COFW and 300W, demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance with fast speed. Our approach is especially robust to challenging cases resulting in impressively low failure rate (0% and 2.88%) in COFW and WFLW datasets.

preprint2020arXiv

Personal VAD: Speaker-Conditioned Voice Activity Detection

In this paper, we propose "personal VAD", a system to detect the voice activity of a target speaker at the frame level. This system is useful for gating the inputs to a streaming on-device speech recognition system, such that it only triggers for the target user, which helps reduce the computational cost and battery consumption, especially in scenarios where a keyword detector is unpreferable. We achieve this by training a VAD-alike neural network that is conditioned on the target speaker embedding or the speaker verification score. For each frame, personal VAD outputs the probabilities for three classes: non-speech, target speaker speech, and non-target speaker speech. Under our optimal setup, we are able to train a model with only 130K parameters that outperforms a baseline system where individually trained standard VAD and speaker recognition networks are combined to perform the same task.

preprint2020arXiv

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user, as part of a streaming speech recognition system. Delivering such a model presents numerous challenges: It should improve the performance when the input signal consists of overlapped speech, and must not hurt the speech recognition performance under all other acoustic conditions. Besides, this model must be tiny, fast, and perform inference in a streaming fashion, in order to have minimal impact on CPU, memory, battery and latency. We propose novel techniques to meet these multi-faceted requirements, including using a new asymmetric loss, and adopting adaptive runtime suppression strength. We also show that such a model can be quantized as a 8-bit integer model and run in realtime.

preprint2014arXiv

(Sr3La2O5)(Zn1-xMnx)2As2: A Bulk Form Diluted Magnetic Semiconductor isostructural to the "32522" Fe-based Superconductors

A new diluted magnetic semiconductor system, (Sr3La2O5)(Zn1-xMnx)2As2, has been synthesized and characterized. 10% Mn substitution for Zn in bulk form (Sr3La2O5)Zn2As2 results in a ferromagnetic ordering below Curie temperature, TC ~ 40 K. (Sr3La2O5)(Zn1-xMnx)2As2 has a layered crystal structure identical to that of 32522-type Fe based superconductors, and represents the fifth DMS family that has a direct counterpart among the FeAs high temperature superconductor families.

preprint2014arXiv

31P NMR Investigation of the Superconductor LiFeP (Tc = 5 K)

We investigate the static and dynamic spin susceptibility of the 111 type Fe-based superconductor LiFeP with Tc ~ 5 K through the measurement of Knight shift 31K and the spin-lattice relaxation rate 1/T1 at 31P site by nuclear magnetic resonance. The constant 31K, small magnitudes of 1/T1T, along with the resistivity rho ~ T^2 all point to the weak spin correlations in LiFeP. 1/T1T display small enhancement toward Tc, indicating that the superconductivity is intimately correlated with the antiferromagnetic spin fluctuations.

preprint2014arXiv

A predicable condition for boundary layer separation of 2-D incompressible fluid flows

In this paper, the solutions of Navier-Stokes equations with Dirichlet boundary conditions governing 2-D incompressible fluid flows are considered. A condition for boundary layer separation, which is determined by initial values and external forces, is obtained. More importantly, the condition can predict directly when and where boundary layer separation will occur. The main technical tool is geometric theory of incompressible flows developed by T. Ma and S.Wang.

preprint2014arXiv

Ba(Zn1-2xMnxCox)2As2: A Bulk Form Diluted Magnetic Semiconductor with n-type Carriers

We report the synthesis and characterization of bulk form diluted magnetic semiconductors Ba(Zn1-2xMnxCox)2As2 (0 <= x <= 0.15) with a crystal structure identical to that of 122-type Fe-based superconductors. Mn and Co co-doping into the parent compound BaZn2As2 results in a ferromagnetic ordering below TC ~ 80 K. Hall effect measurements indicate that the carrier are n-type with the density of ~10^17/cm3. The common crystal structure and excellent lattice matching between the p-type ferromagnetic (Ba1-yKy)(Zn1-xMnx)2As2, the n-type ferromagnetic Ba(Zn1-2xMnxCox)2As2, the antiferrmagnetic BaMn2As2 and the superconducting Ba(Fe1-xCox)2As2 systems make it possible to make various junctions between these systems through the As layer.

preprint2014arXiv

Kernel Principal Component Analysis and its Applications in Face Recognition and Active Shape Models

Principal component analysis (PCA) is a popular tool for linear dimensionality reduction and feature extraction. Kernel PCA is the nonlinear form of PCA, which better exploits the complicated spatial structure of high-dimensional features. In this paper, we first review the basic ideas of PCA and kernel PCA. Then we focus on the reconstruction of pre-images for kernel PCA. We also give an introduction on how PCA is used in active shape models (ASMs), and discuss how kernel PCA can be applied to improve traditional ASMs. Then we show some experimental results to compare the performance of kernel PCA and standard PCA for classification problems. We also implement the kernel PCA-based ASMs, and use it to construct human face models.

preprint2014arXiv

Li1.1(Zn1-xCrx)As: Cr doped I-II-V Diluted Magnetic Semiconductors in Bulk Form

We report the synthesis and characterization of bulk form diluted magnetic semiconductors I-II-V Li1.1(Zn1-xCrx)As (x = 0.03, 0.05, 0.10, 0.15)with a cubic crystal structure identical to that of III-V GaAs and II-VI zinc-blende ZnSe. With p-type carriers created by excess Li, 10% Cr substitution for Zn results in a ferromagnetic ordering below TC ~ 218 K. Li(Zn,Cr)As represents another magnetic semiconducting system with the advantage of decoupling carriers and spins, where carriers are created by adding extra Li and spins are introduced by Cr substitution for Zn.

preprint2014arXiv

MuSR Investigation and Suppression of TC by overdoped Li in Diluted Ferromagnetic Semiconductor Li1+y(Zn1-xMnx)P

We use muon spin relaxation (muSR) to investigate the magnetic properties of a bulk form diluted ferromagnetic semiconductor (DFS) Li1.15(Zn0.9Mn0.1)P with T_C ~ 22 K. MuSR results confirm the gradual development of ferromagnetic ordering below T_C with a nearly 100% magnetic ordered volume. Despite its low carrier density, the relation between static internal field and Curie temperature observed for Li(Zn,Mn)P is consistent with the trend found in (Ga,Mn)As and other bulk DFSs, indicating these systems share a common mechanism for the ferromagnetic exchange interaction. Li1+y(Zn1-xMnx)P has the advantage of decoupled carrier and spin doping, where Mn2+ substitution for Zn2+ introduces spins and Li+ off-stoichiometry provides carriers. This advantage enables us to investigate the influence of overdoped Li on the ferromagnetic ordered state. Overdoping Li suppresses both T_C and saturation moments for a certain amount of spins, which indicates that more carriers are detrimental to the ferromagnetic exchange interaction, and that a delicate balance between charge and spin densities is required to achieve highest T_C.

preprint2013arXiv

(La1-xBax)(Zn1-xMnx)AsO: A Two Dimensional "1111" Diluted Magnetic Semiconductor in Bulk Form

We report the synthesis and characterization of a bulk diluted magnetic semiconductor (La1-xBax)(Zn1-xMnx)AsO (0 <= x <= 0.2) with a layered crystal structure identical to that of the "1111" FeAs superconductors. No ferromagnetic order occurs for (Zn,Mn) substitution in the parent compound LaZnAsO without charge doping. Together with carrier doping via (La,Ba) sub- stitution, a small amount of Mn substituting for Zn results in ferromagnetic order with TC up to ~40 K, although the system remains semiconducting. Muon spin relaxation measurements confirm the development of ferromagnetic order in the entire volume, with the relationship between the internal field and TC consistent with the trend found in (Ga,Mn)As, the "111" Li(Zn,Mn)As, and the "122" (Ba,K)(Zn,Mn)2As2 systems.

preprint2013arXiv

Feature Learning by Multidimensional Scaling and its Applications in Object Recognition

We present the MDS feature learning framework, in which multidimensional scaling (MDS) is applied on high-level pairwise image distances to learn fixed-length vector representations of images. The aspects of the images that are captured by the learned features, which we call MDS features, completely depend on what kind of image distance measurement is employed. With properly selected semantics-sensitive image distances, the MDS features provide rich semantic information about the images that is not captured by other feature extraction techniques. In our work, we introduce the iterated Levenberg-Marquardt algorithm for solving MDS, and study the MDS feature learning with IMage Euclidean Distance (IMED) and Spatial Pyramid Matching (SPM) distance. We present experiments on both synthetic data and real images --- the publicly accessible UIUC car image dataset. The MDS features based on SPM distance achieve exceptional performance for the car recognition task.

preprint2013arXiv

The synthesis and characterization of 1111-type diluted magnetic semiconductors (La1-xSrx)(Zn1-xTMx)AsO (TM = Mn, Fe, Co)

The doping effect of Sr and transition metals Mn, Fe, Co into the direct-gap semiconductor LaZnAsO has been investigated. Our results indicate that the single phase ZrCuSiAs-type tetragonal crystal structure is preserved in (La1-xSrx)(Zn1-xTMx)AsO (TM = Mn, Fe, Co) with the doping level up to x = 0.1. While the system remains semiconducting, doping with Sr and Mn results in ferromagnetic order with TC ~ 30K, and doping with Sr and Fe results in a spin glass like state below ~6K with a saturation moment of ~0.02 muB/Fe, an order of magnitude smaller than the ~0.4 muB/Mn of Sr and Mn doped samples. The same type of magnetic state is observed neither for (Zn,Fe) substitution without carrier doping, nor for Sr and Co doped specimens.

preprint2012arXiv

Charge Multiplicity Asymmetry Correlation Study Searching for Local Parity Violation at RHIC for STAR

It has been suggested that local parity violation in QCD would lead to charge separation of quarks by the Chiral Magnetic Effect (CME) in heavy ion collisions. Charge separation could yield a dynamical charge multiplicity asymmetry with respect to the reaction plane. In this talk, we report results on charge multiplicity asymmetry correlations in $\sqrt{s_{NN}}$ = 200 GeV Au+Au and d+Au collisions by the STAR experiment, as well as from the RHIC beam energy scan. We found that the correlation results could not be explained by CME alone. To gain further insights, we study our results as a function of the measured azimuthal angle range as well as the event-by-event anisotropy parameter $v_2$. The results indicate that the charge separation effect appears to be in-plane rather than out-of-plane. We found that the charge separation effect is proportional to the event-by-event $v_2$ and consistent with zero in events with $v_2 \approx 0$. Our studies suggest that the charge separation effect, within the statistical error, may be a net effect of event anisotropy and correlated particle production. Possible upper limit on the CME imposed by our data will be discussed.

preprint2012arXiv

Fully integrated InGaAs/InP single-photon detector module with gigahertz sine wave gating

InGaAs/InP single-photon avalanche diodes (SPADs) working in the regime of GHz clock rates are crucial components for the high-speed quantum key distribution (QKD). We have developed for the first time a compact, stable and user-friendly tabletop InGaAs/InP single-photon detector system operating at a 1.25 GHz gate rate that fully integrates functions for controlling and optimizing SPAD performance. We characterize the key parameters of the detector system and test the long-term stability of the system for continuous operation of 75 hours. The detector system can substantially enhance QKD performance and our present work paves the way for practical high-speed QKD applications.

preprint2012arXiv

GMM-Based Hidden Markov Random Field for Color Image and 3D Volume Segmentation

In this project, we first study the Gaussian-based hidden Markov random field (HMRF) model and its expectation-maximization (EM) algorithm. Then we generalize it to Gaussian mixture model-based hidden Markov random field. The algorithm is implemented in MATLAB. We also apply this algorithm to color image segmentation problems and 3D volume segmentation problems.

preprint2012arXiv

HMRF-EM-image: Implementation of the Hidden Markov Random Field Model and its Expectation-Maximization Algorithm

In this project, we study the hidden Markov random field (HMRF) model and its expectation-maximization (EM) algorithm. We implement a MATLAB toolbox named HMRF-EM-image for 2D image segmentation using the HMRF-EM framework. This toolbox also implements edge-prior-preserving image segmentation, and can be easily reconfigured for other problems, such as 3D image segmentation.

preprint2010arXiv

Identification of flow-background to subtract in jet-like azimuthal correlation

We derive an analytical form for flow-background to jet-like azimuthal correlation in a cluster approach. We argue that the elliptic flow parameter to use in jet-correlation background is that from two-particle method excluding non-flow correlation unrelated to the reaction plane, but including cross-terms between cluster correlation and cluster flow. We verify our result with Monte Carlo simulations. We discuss implications of our finding in the context of jet-like correlations from STAR and PHENIX.

preprint2010arXiv

Non-flow correlations in a cluster model

We derive analytical forms for nonflow contributions from cluster correlation to the measurement of two-particle elliptic flow (v2{2}). We estimate nonflow contribution from rho->pi+pi decays and find it is negative but not a major contributor to the nonflow effect in v2{2}. We also estimate nonflow contribution from the recent STAR measurement of two-particle angular correlations.

preprint2009arXiv

Non-flow, and what flow to subtract in jet-correlation

We derive analytical forms for non-flow contributions from cluster correlation to two-particle elliptic flow (v2{2}) measure. We also derive an analytical form for jet-correlation flow-background with the same cluster approach. We argue that the elliptic flow v2 parameter to be used in jet-correlation background is that from two-particle method excluding non-flow correlations unrelated to the reaction plane, but including cross-terms between cluster correlation and cluster flow. We verify our result with Monte Carlo simulations. We discuss how one may obtain the v2 parameter for jet-correlation background experimentally.

preprint2008arXiv

First Result of Net-Charge Jet-Correlations from STAR

We presented results on azimuthal correlation of net-charge with high $p_T$ trigger particles. It is found that the net-charge correlation shape is similar to that of total-charge. On the near-side, the net-charge and total-charge $p_T$ spectra have similar shape and both are harder than the inclusives. On the away-side, the correlated spectra are not much harder than the inclusives, and the net-charge/total-charge ratio increases with $p_T$ and is similar to the inclusive ratio.

Quan Wang

What is connected

Connect this record

See the researcher in context

Building this map preview

48 published item(s)

On the global stability and large time behavior of solutions of the Boussinesq equations

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering

USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models

A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation

Attentive Temporal Pooling for Conformer-based Streaming Language Identification in Long-form Speech

Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory

Building Chinese Biomedical Language Models via Multi-Level Text Discrimination

Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

Compact and Robust Deep Learning Architecture for Fluorescence Lifetime Imaging and FPGA Implementation

Comparison of Two Methods for Calculating Magnetic Helicity in the Solar Corona

CVSS Corpus and Massively Multilingual Speech-to-Speech Translation

Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition

Fast fluorescence lifetime imaging analysis via extreme learning machine

Keller-Segel model with Logarithmic Interaction and nonlocal reaction term

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Speaker Diarization with LSTM

Structure-aware Editable Morphable Model for 3D Facial Detail Animation and Manipulation

Turn-to-Diarize: Online Speaker Diarization Constrained by Transformer Transducer Speaker Turn Detection

Dynamical transition of hydromagnetic convection in a rotating fluid layer

Entity Structure Within and Throughout: Modeling Mention Dependencies for Document-Level Relation Extraction

A Comparative Study on Polyp Classification using Convolutional Neural Networks

ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech

CoKE: Contextualized Knowledge Graph Embedding

Fast and Accurate: Structure Coherence Component for Face Alignment

Personal VAD: Speaker-Conditioned Voice Activity Detection

VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device Speech Recognition

(Sr3La2O5)(Zn1-xMnx)2As2: A Bulk Form Diluted Magnetic Semiconductor isostructural to the "32522" Fe-based Superconductors

31P NMR Investigation of the Superconductor LiFeP (Tc = 5 K)

A predicable condition for boundary layer separation of 2-D incompressible fluid flows

Ba(Zn1-2xMnxCox)2As2: A Bulk Form Diluted Magnetic Semiconductor with n-type Carriers

Kernel Principal Component Analysis and its Applications in Face Recognition and Active Shape Models

Li1.1(Zn1-xCrx)As: Cr doped I-II-V Diluted Magnetic Semiconductors in Bulk Form

MuSR Investigation and Suppression of TC by overdoped Li in Diluted Ferromagnetic Semiconductor Li1+y(Zn1-xMnx)P

(La1-xBax)(Zn1-xMnx)AsO: A Two Dimensional "1111" Diluted Magnetic Semiconductor in Bulk Form

Feature Learning by Multidimensional Scaling and its Applications in Object Recognition

The synthesis and characterization of 1111-type diluted magnetic semiconductors (La1-xSrx)(Zn1-xTMx)AsO (TM = Mn, Fe, Co)

Charge Multiplicity Asymmetry Correlation Study Searching for Local Parity Violation at RHIC for STAR

Fully integrated InGaAs/InP single-photon detector module with gigahertz sine wave gating

GMM-Based Hidden Markov Random Field for Color Image and 3D Volume Segmentation

HMRF-EM-image: Implementation of the Hidden Markov Random Field Model and its Expectation-Maximization Algorithm

Identification of flow-background to subtract in jet-like azimuthal correlation

Non-flow correlations in a cluster model

Non-flow, and what flow to subtract in jet-correlation

First Result of Net-Charge Jet-Correlations from STAR