Source author record

Yanjun Ma

Yanjun Ma appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Distributed, Parallel, and Cluster Computing Information Theory math.IT Artificial Intelligence cond-mat.mtrl-sci Machine Learning Biomolecules cond-mat.mes-hall cond-mat.str-el eess.AS quant-ph Sound

Catalog footprint

What is connected

14works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Mitigating Measurement Crosstalk via Pulse Shaping

Quantum error correction protocols require rapid and repeated qubit measurements. While multiplexed readout in superconducting quantum systems improves efficiency, fast probe pulses introduce spectral broadening, leading to signal leakage into neighboring readout resonators. This crosstalk results in qubit dephasing and degraded readout fidelity. Here, we introduce a pulse shaping technique inspired by the derivative removal by adiabatic gate (DRAG) protocol to suppress measurement crosstalk during fast readout. By engineering a spectral notch at neighboring resonator frequencies, the method effectively mitigates spurious signal interference. Our approach integrates seamlessly with existing readout architectures, enabling fast, low-crosstalk multiplexed measurements without additional hardware overhead - a critical advancement for scalable quantum computing.

preprint2022arXiv

Boosting Distributed Training Performance of the Unpadded BERT Model

Pre-training models are an important tool in Natural Language Processing (NLP), while the BERT model is a classic pre-training model whose structure has been widely adopted by followers. It was even chosen as the reference model for the MLPerf training benchmark. The distributed training performance optimization of BERT models plays an important role in accelerating the solutions of most NLP tasks. BERT model often uses padding tensors as its inputs, leading to excessive redundant computations. Thus, removing these redundant computations is essential to improve the distributed training performance. This paper designs a new approach to train BERT models with variable-length inputs efficiently. Firstly, we propose a general structure for the variable-length BERT models, and accelerate the encoder layer via our grouped multi-stream FMHA (Fused Multi-Head Attention) method. Secondly, through data exchange, we address the unbalanced workload problem caused by the variable-length inputs, which overlaps highly with the training process. Finally, we optimize the overall performance of the BERT model, such as kernel fusion, and operator optimization. Our experimental results show that our highly optimized BERT model achieves state-of-the-art throughput and ranks first in MLPerf Training v2.0 within the same GPU configuration. The optimizations in this paper can be applied to more BERT-like models in our future works.

preprint2022arXiv

HelixFold: An Efficient Implementation of AlphaFold2 using PaddlePaddle

Accurate protein structure prediction can significantly accelerate the development of life science. The accuracy of AlphaFold2, a frontier end-to-end structure prediction system, is already close to that of the experimental determination techniques. Due to the complex model architecture and large memory consumption, it requires lots of computational resources and time to implement the training and inference of AlphaFold2 from scratch. The cost of running the original AlphaFold2 is expensive for most individuals and institutions. Therefore, reducing this cost could accelerate the development of life science. We implement AlphaFold2 using PaddlePaddle, namely HelixFold, to improve training and inference speed and reduce memory consumption. The performance is improved by operator fusion, tensor fusion, and hybrid parallelism computation, while the memory is optimized through Recompute, BFloat16, and memory read/write in-place. Compared with the original AlphaFold2 (implemented with Jax) and OpenFold (implemented with PyTorch), HelixFold needs only 7.5 days to complete the full end-to-end training and only 5.3 days when using hybrid parallelism, while both AlphaFold2 and OpenFold take about 11 days. HelixFold saves 1x training time. We verified that HelixFold's accuracy could be on par with AlphaFold2 on the CASP14 and CAMEO datasets. HelixFold's code is available on GitHub for free download: https://github.com/PaddlePaddle/PaddleHelix/tree/dev/apps/protein_folding/helixfold, and we also provide stable web services on https://paddlehelix.baidu.com/app/drug/protein/forecast.

preprint2022arXiv

Nebula-I: A General Framework for Collaboratively Training Deep Learning Models on Low-Bandwidth Cloud Clusters

The ever-growing model size and scale of compute have attracted increasing interests in training deep learning models over multiple nodes. However, when it comes to training on cloud clusters, especially across remote clusters, huge challenges are faced. In this work, we introduce a general framework, Nebula-I, for collaboratively training deep learning models over remote heterogeneous clusters, the connections between which are low-bandwidth wide area networks (WANs). We took natural language processing (NLP) as an example to show how Nebula-I works in different training phases that include: a) pre-training a multilingual language model using two remote clusters; and b) fine-tuning a machine translation model using knowledge distilled from pre-trained models, which run through the most popular paradigm of recent deep learning. To balance the accuracy and communication efficiency, in Nebula-I, parameter-efficient training strategies, hybrid parallel computing methods and adaptive communication acceleration techniques are jointly applied. Meanwhile, security strategies are employed to guarantee the safety, reliability and privacy in intra-cluster computation and inter-cluster communication. Nebula-I is implemented with the PaddlePaddle deep learning framework, which can support collaborative training over heterogeneous hardware, e.g. GPU and NPU. Experiments demonstrate that the proposed framework could substantially maximize the training efficiency while preserving satisfactory NLP performance. By using Nebula-I, users can run large-scale training tasks over cloud clusters with minimum developments, and the utility of existed large pre-trained models could be further promoted. We also introduced new state-of-the-art results on cross-lingual natural language inference tasks, which are generated based upon a novel learning framework and Nebula-I.

preprint2022arXiv

PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit

PaddleSpeech is an open-source all-in-one speech toolkit. It aims at facilitating the development and research of speech processing technologies by providing an easy-to-use command-line interface and a simple code structure. This paper describes the design philosophy and core architecture of PaddleSpeech to support several essential speech-to-text and text-to-speech tasks. PaddleSpeech achieves competitive or state-of-the-art performance on various speech datasets and implements the most popular methods. It also provides recipes and pretrained models to quickly reproduce the experimental results in this paper. PaddleSpeech is publicly avaiable at https://github.com/PaddlePaddle/PaddleSpeech.

preprint2022arXiv

PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model

Real-world applications have high demands for semantic segmentation methods. Although semantic segmentation has made remarkable leap-forwards with deep learning, the performance of real-time methods is not satisfactory. In this work, we propose PP-LiteSeg, a novel lightweight model for the real-time semantic segmentation task. Specifically, we present a Flexible and Lightweight Decoder (FLD) to reduce computation overhead of previous decoder. To strengthen feature representations, we propose a Unified Attention Fusion Module (UAFM), which takes advantage of spatial and channel attention to produce a weight and then fuses the input features with the weight. Moreover, a Simple Pyramid Pooling Module (SPPM) is proposed to aggregate global context with low computation cost. Extensive evaluations demonstrate that PP-LiteSeg achieves a superior trade-off between accuracy and speed compared to other methods. On the Cityscapes test set, PP-LiteSeg achieves 72.0% mIoU/273.6 FPS and 77.5% mIoU/102.6 FPS on NVIDIA GTX 1080Ti. Source code and models are available at PaddleSeg: https://github.com/PaddlePaddle/PaddleSeg.

preprint2022arXiv

PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System

Optical character recognition (OCR) technology has been widely used in various scenes, as shown in Figure 1. Designing a practical OCR system is still a meaningful but challenging task. In previous work, considering the efficiency and accuracy, we proposed a practical ultra lightweight OCR system (PP-OCR), and an optimized version PP-OCRv2. In order to further improve the performance of PP-OCRv2, a more robust OCR system PP-OCRv3 is proposed in this paper. PP-OCRv3 upgrades the text detection model and text recognition model in 9 aspects based on PP-OCRv2. For text detector, we introduce a PAN module with large receptive field named LK-PAN, a FPN module with residual attention mechanism named RSE-FPN, and DML distillation strategy. For text recognizer, the base model is replaced from CRNN to SVTR, and we introduce lightweight text recognition network SVTR LCNet, guided training of CTC by attention, data augmentation strategy TextConAug, better pre-trained model by self-supervised TextRotNet, UDML, and UIM to accelerate the model and improve the effect. Experiments on real data show that the hmean of PP-OCRv3 is 5% higher than PP-OCRv2 under comparable inference speed. All the above mentioned models are open-sourced and the code is available in the GitHub repository PaddleOCR which is powered by PaddlePaddle.

preprint2022arXiv

PP-ShiTu: A Practical Lightweight Image Recognition System

In recent years, image recognition applications have developed rapidly. A large number of studies and techniques have emerged in different fields, such as face recognition, pedestrian and vehicle re-identification, landmark retrieval, and product recognition. In this paper, we propose a practical lightweight image recognition system, named PP-ShiTu, consisting of the following 3 modules, mainbody detection, feature extraction and vector search. We introduce popular strategies including metric learning, deep hash, knowledge distillation and model quantization to improve accuracy and inference speed. With strategies above, PP-ShiTu works well in different scenarios with a set of models trained on a mixed dataset. Experiments on different datasets and benchmarks show that the system is widely effective in different domains of image recognition. All the above mentioned models are open-sourced and the code is available in the GitHub repository PaddleClas on PaddlePaddle.

preprint2019arXiv

Realization of Epitaxial Thin Films of the Topological Crystalline Insulator Sr$_3$SnO

Topological materials are derived from the interplay between symmetry and topology. Advances in topological band theories have led to the prediction that the antiperovskite oxide Sr$_3$SnO is a topological crystalline insulator, a new electronic phase of matter where the conductivity in its (001) crystallographic planes is protected by crystallographic point group symmetries. Realization of this material, however, is challenging. Guided by thermodynamic calculations we design and implement a deposition approach to achieve the adsorption-controlled growth of epitaxial Sr$_3$SnO single-crystal films by molecular-beam epitaxy (MBE). In-situ transport and angle-resolved photoemission spectroscopy measurements reveal the metallic and non-trivial topological nature of the as-grown samples. Compared with conventional MBE, the synthesis route used results in superior sample quality and is readily adapted to other topological systems with antiperovskite structures. The successful realization of thin films of topological crystalline insulators opens opportunities to manipulate topological states by tuning symmetries via epitaxial strain and heterostructuring.

preprint2015arXiv

Ultrafast observation of electron hybridization and in-gap states formation in Kondo insulator SmB6

SmB6 is a promising candidate for topological Kondo insulator. In this letter, we report ultrafast carrier dynamics of SmB6. Two characteristic temperatures: T1=100 K and T2= 20 K are observed. T1 corresponds to the opening of the f-d hybridization gap revealed by an abrupt disappearance of terahertz f-band plasmon oscillations. Between T1 and T2, a phonon bottleneck effect dominates the photocarrier relaxation processes. Below T2, we observe the formation of in-gap states, which are strongly affected by optically injected hot electrons and the transient electron temperature change.

preprint2012arXiv

On the Achievability of Interference Alignment for Three-Cell Constant Cellular Interfering Networks

For a three-cell constant cellular interfering network, a new property of alignment is identified, i.e., interference alignment (IA) solution obtained in an user-cooperation scenario can also be applied in a non-cooperation environment. By using this property, an algorithm is proposed by jointly designing transmit and receive beamforming matrices. Analysis and numerical results show that more degree of freedom (DoF) can be achieved compared with conventional schemes in most cases.

preprint2011arXiv

Distributed Interference Alignment with Low Overhead

Based on closed-form interference alignment (IA) solutions, a low overhead distributed interference alignment (LOIA) scheme is proposed in this paper for the $K$-user SISO interference channel, and extension to multiple antenna scenario is also considered. Compared with the iterative interference alignment (IIA) algorithm proposed by Gomadam et al., the overhead is greatly reduced. Simulation results show that the IIA algorithm is strictly suboptimal compared with our LOIA algorithm in the overhead-limited scenario.

preprint2010arXiv

Group Based Interference Alignment

In the $K$-user single-input single-output (SISO) frequency-selective fading interference channel, it is shown that the maximal achievable multiplexing gain is almost surely $K/2$ by using interference alignment (IA). However, when the signaling dimensions are limited, allocating all the resources to all users simultaneously is not optimal. So, a group based interference alignment (GIA) scheme is proposed, and it is formulated as an unbounded knapsack problem. Optimal and greedy search algorithms are proposed to obtain group patterns. Analysis and numerical results show that the GIA scheme can obtain a higher multiplexing gain when the resources are limited.

preprint2010arXiv

Rewritable nanoscale oxide photodetector

Nanophotonic devices seek to generate, guide, and/or detect light using structures whose nanoscale dimensions are closely tied to their functionality. Semiconducting nanowires, grown with tailored optoelectronic properties, have been successfully placed into devices for a variety of applications. However, the integration of photonic nanostructures with electronic circuitry has always been one of the most challenging aspects of device development. Here we report the development of rewritable nanoscale photodetectors created at the interface between LaAlO3 and SrTiO3. Nanowire junctions with characteristic dimensions 2-3 nm are created using a reversible AFM writing technique. These nanoscale devices exhibit a remarkably high gain for their size, in part because of the large electric fields produced in the gap region. The photoconductive response is gate-tunable and spans the visible-to-near-infrared regime. The ability to integrate rewritable nanoscale photodetectors with nanowires and transistors in a single materials platform foreshadows new families of integrated optoelectronic devices and applications.

Yanjun Ma

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

Mitigating Measurement Crosstalk via Pulse Shaping

Boosting Distributed Training Performance of the Unpadded BERT Model

HelixFold: An Efficient Implementation of AlphaFold2 using PaddlePaddle

Nebula-I: A General Framework for Collaboratively Training Deep Learning Models on Low-Bandwidth Cloud Clusters

PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit

PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model

PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System

PP-ShiTu: A Practical Lightweight Image Recognition System

Realization of Epitaxial Thin Films of the Topological Crystalline Insulator Sr$_3$SnO

Ultrafast observation of electron hybridization and in-gap states formation in Kondo insulator SmB6

On the Achievability of Interference Alignment for Three-Cell Constant Cellular Interfering Networks

Distributed Interference Alignment with Low Overhead

Group Based Interference Alignment

Rewritable nanoscale oxide photodetector