Researcher profile

Long Ma

Long Ma contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
18works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

18 published item(s)

preprint2026arXiv

TopoMamba: Topology-Aware Scanning and Fusion for Segmenting Heterogeneous Medical Visual Media

Visual state-space models (SSMs) have shown strong potential for medical image segmentation, yet their effectiveness is often limited by two practical issues: axis-biased scan ordering weakens the modeling of oblique and curved structures, and naive multi-branch fusion tends to amplify redundant responses. We present TopoMamba, a topology-aware scan-and-fuse framework for segmenting heterogeneous medical visual media. The method combines a diagonal/anti-diagonal TopoA-Scan branch with the standard Cross-Scan branch to provide complementary structural priors, and introduces ScanCache, a device-aware caching mechanism that amortizes explicit scan-index construction across recurring resolutions. To fuse heterogeneous scan features efficiently, we further propose a lightweight HSIC Gate that regulates branch interaction using a dependence-aware scalar gating rule. We also instantiate a volumetric TopoMamba-3D for practical 3D clinical segmentation. Experiments on Synapse CT, ISIC 2017 dermoscopy, and CVC-ClinicDB endoscopy show that TopoMamba consistently improves segmentation quality over strong CNN, Transformer, and SSM baselines, with particularly clear gains on thin or curved targets such as the pancreas and gallbladder, while maintaining favorable deployment efficiency under dynamic input resolutions. These results suggest that topology-aware scan ordering and lightweight dependence-aware fusion form an effective and practical design for medical multimedia segmentation. The code will be made publicly available.

preprint2023arXiv

From Text to Pixels: A Context-Aware Semantic Synergy Solution for Infrared and Visible Image Fusion

With the rapid progression of deep learning technologies, multi-modality image fusion has become increasingly prevalent in object detection tasks. Despite its popularity, the inherent disparities in how different sources depict scene content make fusion a challenging problem. Current fusion methodologies identify shared characteristics between the two modalities and integrate them within this shared domain using either iterative optimization or deep learning architectures, which often neglect the intricate semantic relationships between modalities, resulting in a superficial understanding of inter-modal connections and, consequently, suboptimal fusion outcomes. To address this, we introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images. This method capitalizes on the complementary characteristics of diverse modalities, bolstering both the accuracy and robustness of object detection. The codebook is utilized to enhance a streamlined and concise depiction of the fused intra- and inter-domain dynamics, fine-tuned for optimal performance in detection tasks. We present a bilevel optimization strategy that establishes a nexus between the joint problem of fusion and detection, optimizing both processes concurrently. Furthermore, we introduce the first dataset of paired infrared and visible images accompanied by text prompts, paving the way for future research. Extensive experiments on several datasets demonstrate that our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.

preprint2022arXiv

A practical framework for multi-domain speech recognition and an instance sampling method to neural language modeling

Automatic speech recognition (ASR) systems used on smart phones or vehicles are usually required to process speech queries from very different domains. In such situations, a vanilla ASR system usually fails to perform well on every domain. This paper proposes a multi-domain ASR framework for Tencent Map, a navigation app used on smart phones and in-vehicle infotainment systems. The proposed framework consists of three core parts: a basic ASR module to generate n-best lists of a speech query, a text classification module to determine which domain the speech query belongs to, and a reranking module to rescore n-best lists using domain-specific language models. In addition, an instance sampling based method to training neural network language models (NNLMs) is proposed to address the data imbalance problem in multi-domain ASR. In experiments, the proposed framework was evaluated on navigation domain and music domain, since navigating and playing music are two main features of Tencent Map. Compared to a general ASR system, the proposed framework achieves a relative 13% $\sim$ 22% character error rate reduction on several test sets collected from Tencent Map and our in-car voice assistant.

preprint2022arXiv

CaTT-KWS: A Multi-stage Customized Keyword Spotting Framework based on Cascaded Transducer-Transformer

Customized keyword spotting (KWS) has great potential to be deployed on edge devices to achieve hands-free user experience. However, in real applications, false alarm (FA) would be a serious problem for spotting dozens or even hundreds of keywords, which drastically affects user experience. To solve this problem, in this paper, we leverage the recent advances in transducer and transformer based acoustic models and propose a new multi-stage customized KWS framework named Cascaded Transducer-Transformer KWS (CaTT-KWS), which includes a transducer based keyword detector, a frame-level phone predictor based force alignment module and a transformer based decoder. Specifically, the streaming transducer module is used to spot keyword candidates in audio stream. Then force alignment is implemented using the phone posteriors predicted by the phone predictor to finish the first stage keyword verification and refine the time boundaries of keyword. Finally, the transformer decoder further verifies the triggered keyword. Our proposed CaTT-KWS framework reduces FA rate effectively without obviously hurting keyword recognition accuracy. Specifically, we can get impressively 0.13 FA per hour on a challenging dataset, with over 90% relative reduction on FA comparing to the transducer based detection model, while keyword recognition accuracy only drops less than 2%.

preprint2022arXiv

Censer: Curriculum Semi-supervised Learning for Speech Recognition Based on Self-supervised Pre-training

Recent studies have shown that the benefits provided by self-supervised pre-training and self-training (pseudo-labeling) are complementary. Semi-supervised fine-tuning strategies under the pre-training framework, however, remain insufficiently studied. Besides, modern semi-supervised speech recognition algorithms either treat unlabeled data indiscriminately or filter out noisy samples with a confidence threshold. The dissimilarities among different unlabeled data are often ignored. In this paper, we propose Censer, a semi-supervised speech recognition algorithm based on self-supervised pre-training to maximize the utilization of unlabeled data. The pre-training stage of Censer adopts wav2vec2.0 and the fine-tuning stage employs an improved semi-supervised learning algorithm from slimIPL, which leverages unlabeled data progressively according to their pseudo labels' qualities. We also incorporate a temporal pseudo label pool and an exponential moving average to control the pseudo labels' update frequency and to avoid model divergence. Experimental results on Libri-Light and LibriSpeech datasets manifest our proposed method achieves better performance compared to existing approaches while being more unified.

preprint2022arXiv

Conversational Speech Recognition By Learning Conversation-level Characteristics

Conversational automatic speech recognition (ASR) is a task to recognize conversational speech including multiple speakers. Unlike sentence-level ASR, conversational ASR can naturally take advantages from specific characteristics of conversation, such as role preference and topical coherence. This paper proposes a conversational ASR model which explicitly learns conversation-level characteristics under the prevalent end-to-end neural framework. The highlights of the proposed model are twofold. First, a latent variational module (LVM) is attached to a conformer-based encoder-decoder ASR backbone to learn role preference and topical coherence. Second, a topic model is specifically adopted to bias the outputs of the decoder to words in the predicted topics. Experiments on two Mandarin conversational ASR tasks show that the proposed model achieves a maximum 12% relative character error rate (CER) reduction.

preprint2022arXiv

Effective Charged Exterior Surfaces for Enhanced Ionic Diffusion through Nanopores under Salt Gradients

High-performance osmotic energy conversion requires both large ionic throughput and high ionic selectivity, which can be significantly promoted by exterior surface charges simultaneously, especially for short nanopores. Here, we investigate the enhancement of ionic diffusion by charged exterior surfaces under various conditions and explore corresponding effective charged areas. From simulations, ionic diffusion is promoted more significantly by exterior surface charges through nanopores with a shorter length, wider diameter, and larger surface charge density, or under higher salt gradients. Effective widths of the charged ring regions near nanopores are reversely proportional to the pore length and linearly dependent on the pore diameter, salt gradient, and surface charge density. Due to the important role of effective charged areas in the propagation of ionic diffusion through single nanopores to cases with porous membranes, our results may provide useful guidance to the design and fabrication of porous membranes for practical high-performance osmotic energy harvesting.

preprint2022arXiv

Improving CTC-based speech recognition via knowledge transferring from pre-trained language models

Recently, end-to-end automatic speech recognition models based on connectionist temporal classification (CTC) have achieved impressive results, especially when fine-tuned from wav2vec2.0 models. Due to the conditional independence assumption, CTC-based models are always weaker than attention-based encoder-decoder models and require the assistance of external language models (LMs). To solve this issue, we propose two knowledge transferring methods that leverage pre-trained LMs, such as BERT and GPT2, to improve CTC-based models. The first method is based on representation learning, in which the CTC-based models use the representation produced by BERT as an auxiliary learning target. The second method is based on joint classification learning, which combines GPT2 for text modeling with a hybrid CTC/attention architecture. Experiment on AISHELL-1 corpus yields a character error rate (CER) of 4.2% on the test set. When compared to the vanilla CTC-based models fine-tuned from the wav2vec2.0 models, our knowledge transferring method reduces CER by 16.1% relatively without external LMs.

preprint2022arXiv

Leveraging Acoustic Contextual Representation by Audio-textual Cross-modal Learning for Conversational ASR

Leveraging context information is an intuitive idea to improve performance on conversational automatic speech recognition(ASR). Previous works usually adopt recognized hypotheses of historical utterances as preceding context, which may bias the current recognized hypothesis due to the inevitable historicalrecognition errors. To avoid this problem, we propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech. Specifically, it consists of two modal-related encoders, extracting high-level latent features from speech and the corresponding text, and a cross-modal encoder, which aims to learn the correlation between speech and text. We randomly mask some input tokens and input sequences of each modality. Then a token-missing or modal-missing prediction with a modal-level CTC loss on the cross-modal encoder is performed. Thus, the model captures not only the bi-directional context dependencies in a specific modality but also relationships between different modalities. Then, during the training of the conversational ASR system, the extractor will be frozen to extract the textual representation of preceding speech, while such representation is used as context fed to the ASR decoder through attention mechanism. The effectiveness of the proposed approach is validated on several Mandarin conversation corpora and the highest character error rate (CER) reduction up to 16% is achieved on the MagicData dataset.

preprint2022arXiv

Toward Fast, Flexible, and Robust Low-Light Image Enhancement

Existing low-light image enhancement techniques are mostly not only difficult to deal with both visual quality and computational efficiency but also commonly invalid in unknown complex scenarios. In this paper, we develop a new Self-Calibrated Illumination (SCI) learning framework for fast, flexible, and robust brightening images in real-world low-light scenarios. To be specific, we establish a cascaded illumination learning process with weight sharing to handle this task. Considering the computational burden of the cascaded pattern, we construct the self-calibrated module which realizes the convergence between results of each stage, producing the gains that only use the single basic block for inference (yet has not been exploited in previous works), which drastically diminishes computation cost. We then define the unsupervised training loss to elevate the model capability that can adapt to general scenes. Further, we make comprehensive explorations to excavate SCI's inherent properties (lacking in existing works) including operation-insensitive adaptability (acquiring stable performance under the settings of different simple operations) and model-irrelevant generality (can be applied to illumination-based existing works to improve performance). Finally, plenty of experiments and ablation studies fully indicate our superiority in both quality and efficiency. Applications on low-light face detection and nighttime semantic segmentation fully reveal the latent practical values for SCI. The source code is available at https://github.com/vis-opt-group/SCI.

preprint2021arXiv

High-Performance Nanofluidic Osmotic Power Generation Enabled by Exterior Surface Charges under the Natural Salt Gradient

High-performance osmotic energy conversion (OEC) requires both high ionic selectivity and permeability in nanopores. Here, through systematical explorations of influences from individual charged nanopore surfaces on the performance of OEC, we find that the charged exterior surface on the low-concentration side (surfaceL) is essential to achieve high-performance osmotic power generation, which can significantly improve the ionic selectivity and permeability simultaneously. Detailed investigation of ionic transport indicates that electric double layers near charged surfaces provide high-speed passages for counterions. The charged surfaceL enhances cation diffusion through enlarging the effective diffusive area, and inhibits anion transport by electrostatic repulsion. Different areas of charged exterior surfaces have been considered to mimic membranes with different porosities in practical applications. Through adjusting the width of the charged ring region on the surfaceL, electric power in single nanopores increases from 0.3 to 3.4 pW with a plateau at the width of ~200 nm. The power density increases from 4200 to 4900 W/m2 and then decreases monotonously that reaches the commercial benchmark at the charged width of ~480 nm. While, energy conversion efficiency can be promoted from 4% to 26%. Our results provide useful guide in the design of nanoporous membranes for high-performance osmotic energy harvesting.

preprint2021arXiv

Parton collisional effect on the conversion of geometry eccentricities into momentum anisotropies in relativistic heavy-ion collisions

We explore parton collisional effects on the conversion of geometry eccentricities into azimuthal anisotropies in Pb+Pb collisions at $\sqrt{s_{NN}}$ = 5.02 TeV using a multi-phase transport model. The initial eccentricity $\varepsilon_{n}$ (n = 2,3) and flow harmonics $v_{n}$ (n = 2,3) are investigated as a function of the number of parton collisions ($N_{coll}$) during the source evolution of partonic phase. It is found that partonic collisions leads to generate elliptic flow $v_{2}$ and triangular flow $v_{3}$ in Pb+Pb collisions. On the other hand, partonic collisions also result in an evolution of the eccentricity of geometry. The collisional effect on the flow conversion efficiency is therefore studied. We find that the partons with larger $N_{coll}$ show a lower flow conversion efficiency, which reflect differential behaviors with respect to $N_{coll}$. It provides an additional insight into the dynamics of the space-momentum transformation during the QGP evolution from a transport model point of view.

preprint2021arXiv

Significantly Enhanced Performance of Nanofluidic Osmotic Power Generation by Slipping Surfaces of Nanopores

High-performance osmotic energy conversion (OEC) with perm-selective porous membrane requires both high ionic selectivity and permeability simultaneously. Here, hydrodynamic slip is considered on surfaces of nanopores to break the tradeoff between ionic selectivity and permeability, because it decreases the viscous friction at solid-liquid interfaces which can promote ionic diffusion during OEC. Taking advantage of simulations, influences from individual slipping surfaces on the OEC performance have been investigated, i.e. the slipping inner surface (surfaceinner) and exterior surfaces on the low- and high-concentration sides (surfaceL and surfaceH). Results show that the slipping surfaceL is crucial for high-performance OEC. For nanopores with various lengths, the slipping surfaceL simultaneously increases both ionic permeability and selectivity of nanopores, which results in both significantly enhanced electric power and energy conversion efficiency. While for nanopores longer than 30 nm, the slipping surfaceinner plays a dominant role in the increase of electric power, which induces a considerable decrease in energy conversion efficiency due to enhanced transport of both cations and anions. Considering the difficulty in hydrodynamic slip modification to the surfaceinner of nanopores, the surface modification to the surfaceL may be a better choice to achieve high-performance OEC. Our results provide feasible guidance to the design of porous membranes for high-performance osmotic energy harvesting.

preprint2021arXiv

Tiny Transducer: A Highly-efficient Speech Recognition Model on Edge Devices

This paper proposes an extremely lightweight phone-based transducer model with a tiny decoding graph on edge devices. First, a phone synchronous decoding (PSD) algorithm based on blank label skipping is first used to speed up the transducer decoding process. Then, to decrease the deletion errors introduced by the high blank score, a blank label deweighting approach is proposed. To reduce parameters and computation, deep feedforward sequential memory network (DFSMN) layers are used in the transducer encoder, and a CNN-based stateless predictor is adopted. SVD technology compresses the model further. WFST-based decoding graph takes the context-independent (CI) phone posteriors as input and allows us to flexibly bias user-specific information. Finally, with only 0.9M parameters after SVD, our system could give a relative 9.1% - 20.5% improvement compared with a bigger conventional hybrid system on edge devices.

preprint2020arXiv

Generation of the Squeezed State with an Arbitrary Complex Amplitude Distribution

The squeezed state is important in quantum metrology and quantum information. The most effective generation tool known is the optical parametric oscillator (OPO). Currently, only the squeezed states of lower-order spatial modes can be generated by an OPO. However, the squeezed states of higher-order complex spatial modes are more useful for applications such as quantum metrology, quantum imaging and quantum information. A major challenge for future applications is efficient generation. Here, we use cascaded phase-only spatial light modulators to modulate the amplitude and phase of the incident fundamental mode squeezed state. This efficiently generates a series of squeezed higher-order Hermite-Gauss modes and a squeezed arbitrary complex amplitude distributed mode. The method may yield new applications in biophotonics, quantum metrology and quantum information processing.

preprint2020arXiv

Modulation of Ionic Current Rectification in Ultra-Short Conical Nanopores

Nanopores that exhibit ionic current rectification (ICR) behave like diodes, such that they transport ions more efficiently in one direction than the other. Conical nanopores have been shown to rectify ionic current, but only those with at least 500 nm in length exhibit significant ICR. Here, through the finite element method, we show how ICR of conical nanopores with length below 200 nm can be tuned by controlling individual charged surfaces i.e. inner pore surface (surface_inner), and exterior pore surfaces on the tip and base side (surface_tip and surface_base). The charged surface_inner and surface_tip can induce obvious ICR individually, while the effects of the charged surface_base on ICR can be ignored. The fully charged surface_inner alone could render the nanopore counterion-selective and induces significant ion concentration polarization in the tip region, which causes reverse ICR compared to nanopores with all surface charged. In addition, the direction and degree of rectification can be further tuned by the depth of the charged surface_inner. When considering the exterior membrane surface only, the charged surface_tip causes intra-pore ionic enrichment and depletion under opposite biases which results in significant ICR. Its effective region is within ~40 nm beyond the tip orifice. We also found that individual charged parts of the pore system contributed to ICR in an additive way due to the additive effect on the ion concentration regulation along the pore axis. With various combinations of fully/partially charged surface_inner and surface_tip, diverse ICR ratios from ~2 to ~170 can be achieved. Our findings shed light on the mechanism of ionic current rectification in ultra-short conical nanopores, and provide a useful guide to the design and modification of ultra-short conical nanopores in ionic circuits and nanofluidic sensors.

preprint2020arXiv

Multi-head Monotonic Chunkwise Attention For Online Speech Recognition

The attention mechanism of the Listen, Attend and Spell (LAS) model requires the whole input sequence to calculate the attention context and thus is not suitable for online speech recognition. To deal with this problem, we propose multi-head monotonic chunk-wise attention (MTH-MoChA), an improved version of MoChA. MTH-MoChA splits the input sequence into small chunks and computes multi-head attentions over the chunks. We also explore useful training strategies such as LSTM pooling, minimum world error rate training and SpecAugment to further improve the performance of MTH-MoChA. Experiments on AISHELL-1 data show that the proposed model, along with the training strategies, improve the character error rate (CER) of MoChA from 8.96% to 7.68% on test set. On another 18000 hours in-car speech data set, MTH-MoChA obtains 7.28% CER, which is significantly better than a state-of-the-art hybrid system.

preprint2020arXiv

NMR study of the spin excitations in the frustrated antiferromagnet Yb(BaBO$_3$)$_3$ with a triangular lattice

In this paper, we study the spin excitation properties of the frustrated triangular-lattice antiferromagnet Yb(BaBO$_3$)$_3$ with nuclear magnetic resonance. From the spectral analysis, neither magnetic ordering nor spin freezing is observed with temperature down to $T=0.26$ K, far below its Curie-Weiss temperature $|θ_w|\sim2.3$ K. From the nuclear relaxation measurement, precise temperature-independent spin-lattice relaxation rates are observed at low temperatures under a weak magnetic field, indicating the gapless spin excitations. Further increasing the field intensity, we observe a spin excitation gap with the gap size proportional to the field intensity. These phenomena suggest a very unusual strongly correlated quantum disordered phase, and the implications for the quantum spin liquid state are further discussed.