Source author record

Ming Yan

Ming Yan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

44works

24topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

With the bloom of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks. However, they fall short to comprehend context involving multiple images. A primary reason for this shortcoming is that the visual features for each images are encoded individually by frozen encoders before feeding into the LLM backbone, lacking awareness of other images and the multimodal instructions. We term this issue as prior-LLM modality isolation and propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion prior to feeding the features into LLMs. This paradigm initially "browses" through the inputs for essential insights, and then revisits the inputs to "concentrate" on crucial details, guided by these insights, to achieve a more comprehensive understanding of the multimodal inputs. Additionally, we develop training strategies specifically to enhance the understanding of multi-image inputs. Our method markedly boosts the performance on 7 multi-image scenarios, contributing to increments on average accuracy by 2.13% and 7.60% against strong MLLMs baselines with 3B and 11B LLMs, respectively.

preprint2026arXiv

Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs short-context reasoning, but its performance degrades in long-context scenarios that require both precise grounding and robust long-range reasoning. We identify the "almost-there" phenomenon in long-context reasoning, where trajectories are largely correct but fail at the final step, and attribute this failure to two factors: (1) the lack of high reasoning density in long-context QA data that push LLMs beyond mere grounding toward sophisticated multi-hop reasoning; and (2) the loss of valuable learning signals during long-context RL training due to the indiscriminate penalization of partially correct trajectories with incorrect outcomes. To overcome this bottleneck, we propose DeepReasonQA, a KG-driven synthesis framework that controllably constructs high-difficulty, multi-hop long-context QA pairs with inherent reasoning chains. Building on this, we introduce Long-context Process Advantage Shaping (LongPAS), a simple yet effective method that performs fine-grained credit assignment by evaluating reasoning steps along Validity and Relevance dimensions, which captures critical learning signals from "almost-there" trajectories. Experiments on three long-context reasoning benchmarks show that our approach substantially outperforms RLVR baselines and matches frontier LLMs while using far fewer parameters. Further analysis confirms the effectiveness of our methods in strengthening long-context reasoning while maintaining stable RL training.

preprint2026arXiv

Real-Time Lane Detection via Efficient Feature Alignment and Covariance Optimization for Low-Power Embedded Systems

Real-time lane detection in embedded systems encounters significant challenges due to subtle and sparse visual signals in RGB images, often constrained by limited computational resources and power consumption. Although deep learning models for lane detection categorized into segmentation-based, anchor-based, and curve-based methods there remains a scarcity of universally applicable optimization techniques tailored for low-power embedded environments. To overcome this, we propose an innovative Covariance Distribution Optimization (CDO) module specifically designed for efficient, real-time applications. The CDO module aligns lane feature distributions closely with ground-truth labels, significantly enhancing detection accuracy without increasing computational complexity. Evaluations were conducted on six diverse models across all three method categories, including two optimized for real-time applications and four state-of-the-art (SOTA) models, tested comprehensively on three major datasets: CULane, TuSimple, and LLAMAS. Experimental results demonstrate accuracy improvements ranging from 0.01% to 1.5%. The proposed CDO module is characterized by ease of integration into existing systems without structural modifications and utilizes existing model parameters to facilitate ongoing training, thus offering substantial benefits in performance, power efficiency, and operational flexibility in embedded systems.

preprint2026arXiv

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/

preprint2024arXiv

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

Recently, the strong text creation ability of Large Language Models(LLMs) has given rise to many tools for assisting paper reading or even writing. However, the weak diagram analysis abilities of LLMs or Multimodal LLMs greatly limit their application scenarios, especially for scientific academic paper writing. In this work, towards a more versatile copilot for academic paper writing, we mainly focus on strengthening the multi-modal diagram analysis ability of Multimodal LLMs. By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper. By aligning diagrams in the paper with related paragraphs, we construct professional diagram analysis samples for training and evaluation. M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes. Besides, to better align the copilot with the user's intention, we introduce the `outline' as the control signal, which could be directly given by the user or revised based on auto-generated ones. Comprehensive experiments with a state-of-the-art Mumtimodal LLM demonstrate that training on our dataset shows stronger scientific diagram understanding performance, including diagram captioning, diagram analysis, and outline recommendation. The dataset, code, and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/PaperOwl.

preprint2023arXiv

A Comprehensive Study on Optimizing Systems with Data Processing Units

New hardware, such as SmartNICs, has been released to offload network applications in data centers. Off-path SmartNICs, a type of multi-core SoC SmartNICs, have attracted the attention of many researchers. Unfortunatelly, they lack the fully exploration of off-path SmartNICs. In this paper, we use a BlueField SmartNIC as an example to conduct a systematical study on the advantages and disadvantages of off-path SmartNICs. We make a detailed performance characterization on an off-path SmartNIC including computing power and network communication overhead, and propose the following advices: 1) Directly utilize the specific accelerators on the SmartNIC to offload applications; 2) Offload latency-insensitive background processing to the SmartNIC to reduce the load on the host; 3) Regard the SmartNIC as a new endpoint in the network to expand the computing power and storage resources of the server host; 4) Avoid directly employing the design method for systems based on on-path SmartNICs. We apply these advices to several use cases and show the performance improvements.

preprint2023arXiv

LARP: Language-Agent Role Play for Open-World Games

Language agents have shown impressive problem-solving skills within defined settings and brief timelines. Yet, with the ever-evolving complexities of open-world simulations, there's a pressing need for agents that can flexibly adapt to complex environments and consistently maintain a long-term memory to ensure coherent actions. To bridge the gap between language agents and open-world games, we introduce Language Agent for Role-Playing (LARP), which includes a cognitive architecture that encompasses memory processing and a decision-making assistant, an environment interaction module with a feedback-driven learnable action space, and a postprocessing method that promotes the alignment of various personalities. The LARP framework refines interactions between users and agents, predefined with unique backgrounds and personalities, ultimately enhancing the gaming experience in open-world contexts. Furthermore, it highlights the diverse uses of language models in a range of areas such as entertainment, education, and various simulation scenarios. The project page is released at https://miao-ai-lab.github.io/LARP/.

preprint2022arXiv

A Scheme to fabricate magnetic graphene-like cobalt nitride CoN4monolayer proposed by first-principles calculations

We propose a scheme to fabricate the cobalt nitride CoN4 monolayer, a magnetic graphene-like two-dimensional material, in which all Co and N atoms are in a plane. Under the pressure above 40 GPa, the bulk CoN4 is stabilized in a triclinic phase. With the pressure decreasing, the triclinic phase of CoN4 is transformed into an orthorhombic phase, and the latter is a layered compound with large interlayer spacing. At ambient condition, the weak interlayer couplings are so small that single CoN4 layer can be exfoliated by the mechanical method.

preprint2022arXiv

DictBERT: Dictionary Description Knowledge Enhanced Language Model Pre-training via Contrastive Learning

Although pre-trained language models (PLMs) have achieved state-of-the-art performance on various natural language processing (NLP) tasks, they are shown to be lacking in knowledge when dealing with knowledge driven tasks. Despite the many efforts made for injecting knowledge into PLMs, this problem remains open. To address the challenge, we propose \textbf{DictBERT}, a novel approach that enhances PLMs with dictionary knowledge which is easier to acquire than knowledge graph (KG). During pre-training, we present two novel pre-training tasks to inject dictionary knowledge into PLMs via contrastive learning: \textit{dictionary entry prediction} and \textit{entry description discrimination}. In fine-tuning, we use the pre-trained DictBERT as a plugin knowledge base (KB) to retrieve implicit knowledge for identified entries in an input sequence, and infuse the retrieved knowledge into the input to enhance its representation via a novel extra-hop attention mechanism. We evaluate our approach on a variety of knowledge driven and language understanding tasks, including NER, relation extraction, CommonsenseQA, OpenBookQA and GLUE. Experimental results demonstrate that our model can significantly improve typical PLMs: it gains a substantial improvement of 0.5\%, 2.9\%, 9.0\%, 7.1\% and 3.3\% on BERT-large respectively, and is also effective on RoBERTa-large.

preprint2022arXiv

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.

preprint2022arXiv

Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus

Semantic code search is the task of retrieving relevant code snippet given a natural language query. Different from typical information retrieval tasks, code search requires to bridge the semantic gap between the programming language and natural language, for better describing intrinsic concepts and semantics. Recently, deep neural network for code search has been a hot research topic. Typical methods for neural code search first represent the code snippet and query text as separate embeddings, and then use vector distance (e.g. dot-product or cosine) to calculate the semantic similarity between them. There exist many different ways for aggregating the variable length of code or query tokens into a learnable embedding, including bi-encoder, cross-encoder, and poly-encoder. The goal of the query encoder and code encoder is to produce embeddings that are close with each other for a related pair of query and the corresponding desired code snippet, in which the choice and design of encoder is very significant. In this paper, we propose a novel deep semantic model which makes use of the utilities of not only the multi-modal sources, but also feature extractors such as self-attention, the aggregated vectors, combination of the intermediate representations. We apply the proposed model to tackle the CodeSearchNet challenge about semantic code search. We align cross-lingual embedding for multi-modality learning with large batches and hard example mining, and combine different learned representations for better enhancing the representation learning. Our model is trained on CodeSearchNet corpus and evaluated on the held-out data, the final model achieves 0.384 NDCG and won the first place in this benchmark. Models and code are available at https://github.com/overwindows/SemanticCodeSearch.git.

preprint2022arXiv

On the improved conditions for some primal-dual algorithms

The convex minimization of $f(\mathbf{x})+g(\mathbf{x})+h(\mathbf{A}\mathbf{x})$ over $\mathbb{R}^n$ with differentiable $f$ and linear operator $\mathbf{A}: \mathbb{R}^n\rightarrow \mathbb{R}^m$, has been well-studied in the literature. By considering the primal-dual optimality of the problem, many algorithms are proposed from different perspectives such as monotone operator scheme and fixed point theory. In this paper, we start with a base algorithm to reveal the connection between several algorithms such as AFBA, PD3O and Chambolle-Pock. Then, we prove its convergence under a relaxed assumption associated with the linear operator and characterize the general constraint on primal and dual stepsizes. The result improves the upper bound of stepsizes of AFBA and indicates that Chambolle-Pock, as the special case of the base algorithm when $f=0$, can take the stepsize of the dual iteration up to $4/3$ of the previously proven one.

preprint2022arXiv

Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding

Visual grounding focuses on establishing fine-grained alignment between vision and natural language, which has essential applications in multimodal reasoning systems. Existing methods use pre-trained query-agnostic visual backbones to extract visual feature maps independently without considering the query information. We argue that the visual features extracted from the visual backbones and the features really needed for multimodal reasoning are inconsistent. One reason is that there are differences between pre-training tasks and visual grounding. Moreover, since the backbones are query-agnostic, it is difficult to completely avoid the inconsistency issue by training the visual backbone end-to-end in the visual grounding framework. In this paper, we propose a Query-modulated Refinement Network (QRNet) to address the inconsistent issue by adjusting intermediate features in the visual backbone with a novel Query-aware Dynamic Attention (QD-ATT) mechanism and query-aware multiscale fusion. The QD-ATT can dynamically compute query-dependent visual attention at the spatial and channel levels of the feature maps produced by the visual backbone. We apply the QRNet to an end-to-end visual grounding framework. Extensive experiments show that the proposed method outperforms state-of-the-art methods on five widely used datasets.

preprint2022arXiv

Strong large scale magnetic fields in rotating convection-driven dynamos: the important role of magnetic diffusion

Natural dynamos such as planets and stars generate global scale magnetic field despite the inferred presence of small scale turbulence. Such systems are known as large scale dynamos and are typically driven by convection and influenced by rotation. Previous numerical studies of rotating dynamos generally find that the large scale magnetic field becomes weaker as the flow becomes more turbulent. The underlying physical processes necessary for sustaining so-called large scale dynamos is therefore still debated. Here we use a suite of numerical simulations to show that strong large scale magnetic fields can be generated in rotating convective turbulence provided that two conditions are satisfied: (1) the flow remains rotationally constrained; and (2) magnetic diffusion is important on the small convective length scale. These findings are in agreement with previous asymptotic predictions and suggest that natural dynamos might satisfy these two conditions.

preprint2022arXiv

WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types

Multimodal Entity Linking (MEL) which aims at linking mentions with multimodal contexts to the referent entities from a knowledge base (e.g., Wikipedia), is an essential task for many multimodal applications. Although much attention has been paid to MEL, the shortcomings of existing MEL datasets including limited contextual topics and entity types, simplified mention ambiguity, and restricted availability, have caused great obstacles to the research and application of MEL. In this paper, we present WikiDiverse, a high-quality human-annotated MEL dataset with diversified contextual topics and entity types from Wikinews, which uses Wikipedia as the corresponding knowledge base. A well-tailored annotation procedure is adopted to ensure the quality of the dataset. Based on WikiDiverse, a sequence of well-designed MEL models with intra-modality and inter-modality attentions are implemented, which utilize the visual information of images more adequately than existing MEL models do. Extensive experimental analyses are conducted to investigate the contributions of different modalities in terms of MEL, facilitating the future research on this task. The dataset and baseline models are available at https://github.com/wangxw5/wikiDiverse.

preprint2021arXiv

CoRe: An Efficient Coarse-refined Training Framework for BERT

In recent years, BERT has made significant breakthroughs on many natural language processing tasks and attracted great attentions. Despite its accuracy gains, the BERT model generally involves a huge number of parameters and needs to be trained on massive datasets, so training such a model is computationally very challenging and time-consuming. Hence, training efficiency should be a critical issue. In this paper, we propose a novel coarse-refined training framework named CoRe to speed up the training of BERT. Specifically, we decompose the training process of BERT into two phases. In the first phase, by introducing fast attention mechanism and decomposing the large parameters in the feed-forward network sub-layer, we construct a relaxed BERT model which has much less parameters and much lower model complexity than the original BERT, so the relaxed model can be quickly trained. In the second phase, we transform the trained relaxed BERT model into the original BERT and further retrain the model. Thanks to the desired initialization provided by the relaxed model, the retraining phase requires much less training steps, compared with training an original BERT model from scratch with a random initialization. Experimental results show that the proposed CoRe framework can greatly reduce the training time without reducing the performance.

preprint2021arXiv

New convergence analysis of a primal-dual algorithm with large stepsizes

We consider a primal-dual algorithm for minimizing $f(x)+h\square l(Ax)$ with Fréchet differentiable $f$ and $l^*$. This primal-dual algorithm has two names in literature: Primal-Dual Fixed-Point algorithm based on the Proximity Operator (PDFP$^2$O) and Proximal Alternating Predictor-Corrector (PAPC). In this paper, we prove its convergence under a weaker condition on the stepsizes than existing ones. With additional assumptions, we show its linear convergence. In addition, we show that this condition (the upper bound of the stepsize) is tight and can not be weakened. This result also recovers a recently proposed positive-indefinite linearized augmented Lagrangian method. In addition, we apply this result to a decentralized consensus algorithm PG-EXTRA and derive the weakest convergence condition.

preprint2021arXiv

On linear convergence of two decentralized algorithms

Decentralized algorithms solve multi-agent problems over a connected network, where the information can only be exchanged with the accessible neighbors. Though there exist several decentralized optimization algorithms, there are still gaps in convergence conditions and rates between decentralized and centralized algorithms. In this paper, we fill some gaps by considering two decentralized algorithms: EXTRA and NIDS. They both converge linearly with strongly convex objective functions. We will answer two questions regarding them. What are the optimal upper bounds for their stepsizes? Do decentralized algorithms require more properties on the functions for linear convergence than centralized ones? More specifically, we relax the required conditions for linear convergence of both algorithms. For EXTRA, we show that the stepsize is comparable to that of centralized algorithms. For NIDS, the upper bound of the stepsize is shown to be exactly the same as the centralized ones. In addition, we relax the requirement for the objective functions and the mixing matrices. We provide the linear convergence results for both algorithms under the weakest conditions.

preprint2020arXiv

A Multi-Agent Primal-Dual Strategy for Composite Optimization over Distributed Features

This work studies multi-agent sharing optimization problems with the objective function being the sum of smooth local functions plus a convex (possibly non-smooth) function coupling all agents. This scenario arises in many machine learning and engineering applications, such as regression over distributed features and resource allocation. We reformulate this problem into an equivalent saddle-point problem, which is amenable to decentralized solutions. We then propose a proximal primal-dual algorithm and establish its linear convergence to the optimal solution when the local functions are strongly-convex. To our knowledge, this is the first linearly convergent decentralized algorithm for multi-agent sharing problems with a general convex (possibly non-smooth) coupling function.

preprint2020arXiv

A Novel Regularization Based on the Error Function for Sparse Recovery

Regularization plays an important role in solving ill-posed problems by adding extra information about the desired solution, such as sparsity. Many regularization terms usually involve some vector norm, e.g., $L_1$ and $L_2$ norms. In this paper, we propose a novel regularization framework that uses the error function to approximate the unit step function. It can be considered as a surrogate function for the $L_0$ norm. The asymptotic behavior of the error function with respect to its intrinsic parameter indicates that the proposed regularization can approximate the standard $L_0$, $L_1$ norms as the parameter approaches to $0$ and $\infty,$ respectively. Statistically, it is also less biased than the $L_1$ approach. We then incorporate the error function into either a constrained or an unconstrained model when recovering a sparse signal from an under-determined linear system. Computationally, both problems can be solved via an iterative reweighted $L_1$ (IRL1) algorithm with guaranteed convergence. A large number of experimental results demonstrate that the proposed approach outperforms the state-of-the-art methods in various sparse recovery scenarios.

preprint2020arXiv

Accelerated Schemes for the $L_1/L_2$ Minimization

In this paper, we consider the $L_1/L_2 $ minimization for sparse recovery and study its relationship with the $L_1$-$ αL_2 $ model. Based on this relationship, we propose three numerical algorithms to minimize this ratio model, two of which work as adaptive schemes and greatly reduce the computation time. Focusing on two adaptive schemes, we discuss their connection to existing approaches and analyze their convergence. The experimental results demonstrate the proposed approaches are comparable to the state-of-the-art methods in sparse recovery and work particularly well when the ground-truth signal has a high dynamic range. Lastly, we reveal some empirical evidence on the exact $L_1$ recovery under various combinations of sparsity, coherence, and dynamic ranges, which calls for theoretical justification in the future.

preprint2020arXiv

Efficient Hyperparameter Optimization in Deep Learning Using a Variable Length Genetic Algorithm

Convolutional Neural Networks (CNN) have gained great success in many artificial intelligence tasks. However, finding a good set of hyperparameters for a CNN remains a challenging task. It usually takes an expert with deep knowledge, and trials and errors. Genetic algorithms have been used in hyperparameter optimizations. However, traditional genetic algorithms with fixed-length chromosomes may not be a good fit for optimizing deep learning hyperparameters, because deep learning models have variable number of hyperparameters depending on the model depth. As the depth increases, the number of hyperparameters grows exponentially, and searching becomes exponentially harder. It is important to have an efficient algorithm that can find a good model in reasonable time. In this article, we propose to use a variable length genetic algorithm (GA) to systematically and automatically tune the hyperparameters of a CNN to improve its performance. Experimental results show that our algorithm can find good CNN hyperparameters efficiently. It is clear from our experiments that if more time is spent on optimizing the hyperparameters, better results could be achieved. Theoretically, if we had unlimited time and CPU power, we could find the optimized hyperparameters and achieve the best results in the future.

preprint2020arXiv

Fast algorithms for robust principal component analysis with an upper bound on the rank

The robust principal component analysis (RPCA) decomposes a data matrix into a low-rank part and a sparse part. There are mainly two types of algorithms for RPCA. The first type of algorithm applies regularization terms on the singular values of a matrix to obtain a low-rank matrix. However, calculating singular values can be very expensive for large matrices. The second type of algorithm replaces the low-rank matrix as the multiplication of two small matrices. They are faster than the first type because no singular value decomposition (SVD) is required. However, the rank of the low-rank matrix is required, and an accurate rank estimation is needed to obtain a reasonable solution. In this paper, we propose algorithms that combine both types. Our proposed algorithms require an upper bound of the rank and SVD on small matrices. First, they are faster than the first type because the cost of SVD on small matrices is negligible. Second, they are more robust than the second type because an upper bound of the rank instead of the exact rank is required. Furthermore, we apply the Gauss-Newton method to increase the speed of our algorithms. Numerical experiments show the better performance of our proposed algorithms.

preprint2020arXiv

PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation

Self-supervised pre-training, such as BERT, MASS and BART, has emerged as a powerful technique for natural language understanding and generation. Existing pre-training techniques employ autoencoding and/or autoregressive objectives to train Transformer-based models by recovering original word tokens from corrupted text with some masked tokens. The training goals of existing techniques are often inconsistent with the goals of many language generation tasks, such as generative question answering and conversational response generation, for producing new text given context. This work presents PALM with a novel scheme that jointly pre-trains an autoencoding and autoregressive language model on a large unlabeled corpus, specifically designed for generating new text conditioned on context. The new scheme alleviates the mismatch introduced by the existing denoising scheme between pre-training and fine-tuning where generation is more than reconstructing original text. An extensive set of experiments show that PALM achieves new state-of-the-art results on a variety of language generation benchmarks covering generative question answering (Rank 1 on the official MARCO leaderboard), abstractive summarization on CNN/DailyMail as well as Gigaword, question generation on SQuAD, and conversational response generation on Cornell Movie Dialogues.

preprint2019arXiv

A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates

This paper proposes a novel proximal-gradient algorithm for a decentralized optimization problem with a composite objective containing smooth and non-smooth terms. Specifically, the smooth and nonsmooth terms are dealt with by gradient and proximal updates, respectively. The proposed algorithm is closely related to a previous algorithm, PG-EXTRA \cite{shi2015proximal}, but has a few advantages. First of all, agents use uncoordinated step-sizes, and the stable upper bounds on step-sizes are independent of network topologies. The step-sizes depend on local objective functions, and they can be as large as those of the gradient descent. Secondly, for the special case without non-smooth terms, linear convergence can be achieved under the strong convexity assumption. The dependence of the convergence rate on the objective functions and the network are separated, and the convergence rate of the new algorithm is as good as one of the two convergence rates that match the typical rates for the general gradient descent and the consensus averaging. We provide numerical experiments to demonstrate the efficacy of the introduced algorithm and validate our theoretical discoveries.

preprint2019arXiv

A Double Residual Compression Algorithm for Efficient Distributed Learning

Large-scale machine learning models are often trained by parallel stochastic gradient descent algorithms. However, the communication cost of gradient aggregation and model synchronization between the master and worker nodes becomes the major obstacle for efficient learning as the number of workers and the dimension of the model increase. In this paper, we propose DORE, a DOuble REsidual compression stochastic gradient descent algorithm, to reduce over $95\%$ of the overall communication such that the obstacle can be immensely mitigated. Our theoretical analyses demonstrate that the proposed strategy has superior convergence properties for both strongly convex and nonconvex objective functions. The experimental results validate that DORE achieves the best communication efficiency while maintaining similar model accuracy and convergence speed in comparison with start-of-the-art baselines.

preprint2018arXiv

Fast Signal Recovery from Saturated Measurements by Linear Loss and Nonconvex Penalties

Sign information is the key to overcoming the inevitable saturation error in compressive sensing systems, which causes information loss and results in bias. For sparse signal recovery from saturation, we propose to use a linear loss to improve the effectiveness from existing methods that utilize hard constraints/hinge loss for sign consistency. Due to the use of linear loss, an analytical solution in the update progress is obtained, and some nonconvex penalties are applicable, e.g., the minimax concave penalty, the $\ell_0$ norm, and the sorted $\ell_1$ norm. Theoretical analysis reveals that the estimation error can still be bounded. Generally, with linear loss and nonconvex penalties, the recovery performance is significantly improved, and the computational time is largely saved, which is verified by the numerical experiments.

preprint2017arXiv

Mixed one-bit compressive sensing with applications to overexposure correction for CT reconstruction

When a measurement falls outside the quantization or measurable range, it becomes saturated and cannot be used in classical reconstruction methods. For example, in C-arm angiography systems, which provide projection radiography, fluoroscopy, digital subtraction angiography, and are widely used for medical diagnoses and interventions, the limited dynamic range of C-arm flat detectors leads to overexposure in some projections during an acquisition, such as imaging relatively thin body parts (e.g., the knee). Aiming at overexposure correction for computed tomography (CT) reconstruction, we in this paper propose a mixed one-bit compressive sensing (M1bit-CS) to acquire information from both regular and saturated measurements. This method is inspired by the recent progress on one-bit compressive sensing, which deals with only sign observations. Its successful applications imply that information carried by saturated measurements is useful to improve recovery quality. For the proposed M1bit-CS model, alternating direction methods of multipliers is developed and an iterative saturation detection scheme is established. Then we evaluate M1bit-CS on one-dimensional signal recovery tasks. In some experiments, the performance of the proposed algorithms on mixed measurements is almost the same as recovery on unsaturated ones with the same amount of measurements. Finally, we apply the proposed method to overexposure correction for CT reconstruction on a phantom and a simulated clinical image. The results are promising, as the typical streaking artifacts and capping artifacts introduced by saturated projection data are effectively reduced, yielding significant error reduction compared with existing algorithms based on extrapolation.

preprint2017arXiv

On the Convergence of Asynchronous Parallel Iteration with Unbounded Delays

Recent years have witnessed the surge of asynchronous parallel (async-parallel) iterative algorithms due to problems involving very large-scale data and a large number of decision variables. Because of asynchrony, the iterates are computed with outdated information, and the age of the outdated information, which we call delay, is the number of times it has been updated since its creation. Almost all recent works prove convergence under the assumption of a finite maximum delay and set their stepsize parameters accordingly. However, the maximum delay is practically unknown. This paper presents convergence analysis of an async-parallel method from a probabilistic viewpoint, and it allows for large unbounded delays. An explicit formula of stepsize that guarantees convergence is given depending on delays' statistics. With $p+1$ identical processors, we empirically measured that delays closely follow the Poisson distribution with parameter $p$, matching our theoretical model, and thus the stepsize can be set accordingly. Simulations on both convex and nonconvex optimization problems demonstrate the validness of our analysis and also show that the existing maximum-delay induced stepsize is too conservative, often slowing down the convergence of the algorithm.

preprint2016arXiv

A Multiphase Image Segmentation Based on Fuzzy Membership Functions and L1-norm Fidelity

In this paper, we propose a variational multiphase image segmentation model based on fuzzy membership functions and L1-norm fidelity. Then we apply the alternating direction method of multipliers to solve an equivalent problem. All the subproblems can be solved efficiently. Specifically, we propose a fast method to calculate the fuzzy median. Experimental results and comparisons show that the L1-norm based method is more robust to outliers such as impulse noise and keeps better contrast than its L2-norm counterpart. Theoretically, we prove the existence of the minimizer and analyze the convergence of the algorithm.

preprint2016arXiv

ARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates

Finding a fixed point to a nonexpansive operator, i.e., $x^*=Tx^*$, abstracts many problems in numerical linear algebra, optimization, and other areas of scientific computing. To solve fixed-point problems, we propose ARock, an algorithmic framework in which multiple agents (machines, processors, or cores) update $x$ in an asynchronous parallel fashion. Asynchrony is crucial to parallel computing since it reduces synchronization wait, relaxes communication bottleneck, and thus speeds up computing significantly. At each step of ARock, an agent updates a randomly selected coordinate $x_i$ based on possibly out-of-date information on $x$. The agents share $x$ through either global memory or communication. If writing $x_i$ is atomic, the agents can read and write $x$ without memory locks. Theoretically, we show that if the nonexpansive operator $T$ has a fixed point, then with probability one, ARock generates a sequence that converges to a fixed points of $T$. Our conditions on $T$ and step sizes are weaker than comparable work. Linear convergence is also obtained. We propose special cases of ARock for linear systems, convex optimization, machine learning, as well as distributed and decentralized consensus problems. Numerical experiments of solving sparse logistic regression problems are presented.

preprint2016arXiv

Asynchronous Multi-Task Learning

Many real-world machine learning applications involve several learning tasks which are inter-related. For example, in healthcare domain, we need to learn a predictive model of a certain disease for many hospitals. The models for each hospital may be different because of the inherent differences in the distributions of the patient populations. However, the models are also closely related because of the nature of the learning tasks modeling the same disease. By simultaneously learning all the tasks, multi-task learning (MTL) paradigm performs inductive knowledge transfer among tasks to improve the generalization performance. When datasets for the learning tasks are stored at different locations, it may not always be feasible to transfer the data to provide a data-centralized computing environment due to various practical issues such as high data volume and privacy. In this paper, we propose a principled MTL framework for distributed and asynchronous optimization to address the aforementioned challenges. In our framework, gradient update does not wait for collecting the gradient information from all the tasks. Therefore, the proposed method is very efficient when the communication delay is too high for some task nodes. We show that many regularized MTL formulations can benefit from this framework, including the low-rank MTL for shared subspace learning. Empirical studies on both synthetic and real-world datasets demonstrate the efficiency and effectiveness of the proposed framework.

preprint2016arXiv

Coordinate Friendly Structures, Algorithms and Applications

This paper focuses on coordinate update methods, which are useful for solving problems involving large or high-dimensional datasets. They decompose a problem into simple subproblems, where each updates one, or a small block of, variables while fixing others. These methods can deal with linear and nonlinear mappings, smooth and nonsmooth functions, as well as convex and nonconvex problems. In addition, they are easy to parallelize. The great performance of coordinate update methods depends on solving simple sub-problems. To derive simple subproblems for several new classes of applications, this paper systematically studies coordinate-friendly operators that perform low-cost coordinate updates. Based on the discovered coordinate friendly operators, as well as operator splitting techniques, we obtain new coordinate update algorithms for a variety of problems in machine learning, image processing, as well as sub-areas of optimization. Several problems are treated with coordinate update for the first time in history. The obtained algorithms are scalable to large instances through parallel and even asynchronous computing. We present numerical examples to illustrate how effective these algorithms are.

preprint2016arXiv

Spin-Cherenkov effect in a magnetic nanostrip with interfacial Dzyaloshinskii-Moriya interaction

Spin-Cherenkov effect enables strong excitations of spin waves (SWs) with nonlinear wave dispersions. The Dzyaloshinskii-Moriya interaction (DMI) results in anisotropy and nonreciprocity of SWs propagation. In this work, we study the effect of the interfacial DMI on SW Cherenkov excitations in permalloy thin-film strips within the framework of micromagnetism. By performing micromagnetic simulations, it is shown that coherent SWs are excited when the velocity of a moving magnetic source exceeds the propagation velocity of the SWs. Moreover, the threshold velocity of the moving magnetic source with finite DMI can be reduced compared to the case of zero DMI. It thereby provides a promising route towards efficient SW generation and propagation, with potential applications in spintronic and magnonic devices.

preprint2015arXiv

Self Equivalence of the Alternating Direction Method of Multipliers

The alternating direction method of multipliers (ADM or ADMM) breaks a complex optimization problem into much simpler subproblems. The ADM algorithms are typically short and easy to implement yet exhibit (nearly) state-of-the-art performance for large-scale optimization problems. To apply ADM, we first formulate a given problem into the "ADM-ready" form, so the final algorithm depends on the formulation. A problem like $\mbox{minimize}_\mathbf{x} u(\mathbf{x}) + v(\mathbf{C}\mathbf{x})$ has six different "ADM-ready" formulations. They can be in the primal or dual forms, and they differ by how dummy variables are introduced. To each "ADM-ready" formulation, ADM can be applied in two different orders depending on how the primal variables are updated. Finally, we get twelve different ADM algorithms! How do they compare to each other? Which algorithm should one choose? In this paper, we show that many of the different ways of applying ADM are equivalent. Specifically, we show that ADM applied to a primal formulation is equivalent to ADM applied to its Lagrange dual; ADM is equivalent to a primal-dual algorithm applied to the saddle-point formulation of the same problem. These results are surprising since the primal and dual variables in ADM are seemingly treated very differently, and some previous work exhibit preferences in one over the other on specific problems. In addition, when one of the two objective functions is quadratic, possibly subject to an affine constraint, we show that swapping the update order of the two primal variables in ADM gives the same algorithm. These results identify the few truly different ADM algorithms for a problem, which generally have different forms of subproblems from which it is easy to pick one with the most computationally friendly subproblems.

preprint2014arXiv

An octave spanning mid-infrared frequency comb generated in a silicon nanophotonic wire waveguide

We demonstrate an octave-spanning frequency comb with a spectrum covering wavelengths from 1,540 nm up to 3,200 nm. The supercontinuum is generated by pumping a 1-cm long dispersion engineered silicon wire waveguide by 70 fs pulses with an energy of merely 15 pJ. We confirm the phase coherence of the output spectrum by beating the supercontinuum with narrow bandwidth CW lasers. We show that the experimental results are in agreement with numerical simulations.

preprint2014arXiv

Differential femtosecond coherent Stokes and anti-Stokes Raman spectroscopy

We demonstrate a novel technique of coherent Raman spectroscopy with a femtosecond laser. We apply to a molecular sample a sequence of pairs of ultrashort excitation and probe pulses, with a linearly increasing time delay between the two pulses from one pair to the next. We measure, as a function of the delay, the intensity modulation in the signal resulting from the differential detection of the Stokes and anti-Stokes radiations generated at the sample. The Fourier transform of such time-domain signal reveals the spectrum of the excited vibrational Raman transitions. The experimental proof-of-principle demonstrates high resolution, broad spectral span and suppression of the non-resonant background, as well as sensitivity enhancement due to the differential detection.

preprint2014arXiv

Fast Adaptive Algorithm for Robust Evaluation of Quality of Experience

Outlier detection is an integral part of robust evaluation for crowdsourceable Quality of Experience (QoE) and has attracted much attention in recent years. In QoE for multimedia, outliers happen because of different test conditions, human errors, abnormal variations in context, {etc}. In this paper, we propose a simple yet effective algorithm for outlier detection and robust QoE evaluation named iterative Least Trimmed Squares (iLTS). The algorithm assigns binary weights to samples, i.e., 0 or 1 indicating if a sample is an outlier, then the outlier-trimmed subset least squares solutions give robust ranking scores. An iterative optimization is carried alternatively between updating weights and ranking scores which converges to a local optimizer in finite steps. In our test setting, iLTS is up to 190 times faster than LASSO-based methods with a comparable performance. Moreover, a varied version of this method shows adaptation in outlier detection, which provides an automatic detection to determine whether a data sample is an outlier without \emph{a priori} knowledge about the amount of the outliers. The effectiveness and efficiency of iLTS are demonstrated on both simulated examples and real-world applications. A Matlab package is provided to researchers exploiting crowdsourcing paired comparison data for robust ranking.

preprint2014arXiv

Few-cycle, Broadband, Mid-infrared Optical Parametric Oscillator Pumped by a 20-fs Ti:sapphire Laser

We report a few-cycle, broadband, singly-resonant optical parametric oscillator (OPO) for the mid-infrared based on MgO-doped periodically-poled LiNbO3 (MgO:PPLN), synchronously pumped by a 20-fs Ti:sapphire laser. By using crystal interaction lengths as short as 250 um, and careful dispersion management of input pump pulses and the OPO resonator, near-transform-limited, few-cycle idler pulses tunable across the mid-infrared have been generated, with as few as 3.7 optical cycles at 2682 nm. The OPO can be continuously tuned over 2179-3732 nm by cavity delay tuning, providing up to 33 mW of output power at 3723 nm. The idler spectra exhibit stable broadband profiles with bandwidths spaning over 422 nm (FWHM) recorded at 3732 nm. We investigate the effect of crystal length on spectral bandwidth and pulse duration at a fixed wavelength, confirming near-transform-limited idler pulses for all grating interaction lengths. By locking the repetition frequency of the pump laser to a radio-frequency reference, and without active stabilization of the OPO cavity length, an idler power stability better than 1.6% rms over >2.75 hours is obtained when operating at maximum output power, in excellent spatial beam quality with TEM00 mode profile.

preprint2014arXiv

Nonconvex Sorted $\ell_1$ Minimization for Sparse Approximation

The $\ell_1$ norm is the tight convex relaxation for the $\ell_0$ "norm" and has been successfully applied for recovering sparse signals. For problems with fewer samplings, one needs to enhance the sparsity by nonconvex penalties such as $\ell_p$ "norm". As one method for solving $\ell_p$ minimization problems, iteratively reweighted $\ell_1$ minimization updates the weight for each component based on the value of the same component at the previous iteration. It assigns large weights on small components in magnitude and small weights on large components in magnitude. In this paper, we consider a weighted $\ell_1$ penalty with the set of the weights fixed and the weights are assigned based on the sort of all the components in magnitude. The smallest weight is assigned to the largest component in magnitude. This new penalty is called nonconvex sorted $\ell_1$. Then we propose two methods for solving nonconvex sorted $\ell_1$ minimization problems: iteratively reweighted $\ell_1$ minimization and iterative sorted thresholding, and prove that both methods will converge to a local optimum. We also show that both methods are generalizations of iterative support detection and iterative hard thresholding respectively. The numerical experiments demonstrate the better performance of assigning weights by sort compared to $\ell_p$ minimization.

preprint2014arXiv

One condition for solution uniqueness and robustness of both l1-synthesis and l1-analysis minimizations

The $\ell_1$-synthesis model and the $\ell_1$-analysis model recover structured signals from their undersampled measurements. The solution of former is a sparse sum of dictionary atoms, and that of the latter makes sparse correlations with dictionary atoms. This paper addresses the question: when can we trust these models to recover specific signals? We answer the question with a condition that is both necessary and sufficient to guarantee the recovery to be unique and exact and, in presence of measurement noise, to be robust. The condition is one--for--all in the sense that it applies to both of the $\ell_1$-synthesis and $\ell_1$-analysis models, to both of their constrained and unconstrained formulations, and to both the exact recovery and robust recovery cases. Furthermore, a convex infinity--norm program is introduced for numerically verifying the condition. A comprehensive comparison with related existing conditions are included.

preprint2014arXiv

The Continuity of Images by Transmission Imaging Revisited

Transmission imaging, as an important imaging technique widely used in astronomy, medical diagnosis, and biology science, has been shown in [49] quite different from reflection imaging used in our everyday life. Understanding the structures of images (the prior information) is important for designing, testing, and choosing image processing methods, and good image processing methods are helpful for further uses of the image data, e.g., increasing the accuracy of the object reconstruction methods in transmission imaging applications. In reflection imaging, the images are usually modeled as discontinuous functions and even piecewise constant functions. In transmission imaging, it was shown very recently in [49] that almost all images are continuous functions. However, the author in [49] considered only the case of parallel beam geometry and used some too strong assumptions in the proof, which exclude some common cases such as cylindrical objects. In this paper, we consider more general beam geometries and simplify the assumptions by using totally different techniques. In particular, we will prove that almost all images in transmission imaging with both parallel and divergent beam geometries (two most typical beam geometries) are continuous functions, under much weaker assumptions than those in [49], which admit almost all practical cases. Besides, taking into accounts our analysis, we compare two image processing methods for Poisson noise (which is the most significant noise in transmission imaging) removal. Numerical experiments will be provided to demonstrate our analysis.

preprint2013arXiv

Restoration of Images Corrupted by Impulse Noise and Mixed Gaussian Impulse Noise using Blind Inpainting

This article studies the problem of image restoration of observed images corrupted by impulse noise and mixed Gaussian impulse noise. Since the pixels damaged by impulse noise contain no information about the true image, how to find this set correctly is a very important problem. We propose two methods based on blind inpainting and $\ell_0$ minimization that can simultaneously find the damaged pixels and restore the image. By iteratively restoring the image and updating the set of damaged pixels, these methods have better performance than other methods, as shown in the experiments. In addition, we provide convergence analysis for these methods, these algorithms will converge to coordinatewise minimum points. In addition, they will converge to local minimum points (or with probability one) with some modifications in the algorithms.

preprint2010arXiv

The magnonic limit of domain wall propagation in ferromagnetic nanotubes

We report a study on the field-driven propagation of vortex-like domain walls in ferromagnetic nanotubes. This particular geometry gives rise to a special feature of the static wall configuration, which significantly influences its dynamics. Unlike domain walls in flat strips, the left-right symmetry of domain wall propagation is broken. Furthermore, the domain wall velocity is not limited by the Walker breakdown. Under sufficiently large magnetic fields, the domain wall velocity reaches the velocity of spin waves (about 1000 m/s) and is thereafter connected with a direct emission of spin waves. The moving domain wall maintains its main structure but has characteristic spin-wave tails attached. The spatial profile of this topological soliton is determined by the spin-wave dispersion.

Ming Yan

What is connected

Connect this record

See the researcher in context

Building this map preview

44 published item(s)

Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion

Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping

Real-Time Lane Detection via Efficient Feature Alignment and Covariance Optimization for Low-Power Embedded Systems

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

A Comprehensive Study on Optimizing Systems with Data Processing Units

LARP: Language-Agent Role Play for Open-World Games

A Scheme to fabricate magnetic graphene-like cobalt nitride CoN4monolayer proposed by first-principles calculations

DictBERT: Dictionary Description Knowledge Enhanced Language Model Pre-training via Contrastive Learning

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus

On the improved conditions for some primal-dual algorithms

Shifting More Attention to Visual Backbone: Query-modulated Refinement Networks for End-to-End Visual Grounding

Strong large scale magnetic fields in rotating convection-driven dynamos: the important role of magnetic diffusion

WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types

CoRe: An Efficient Coarse-refined Training Framework for BERT

New convergence analysis of a primal-dual algorithm with large stepsizes

On linear convergence of two decentralized algorithms

A Multi-Agent Primal-Dual Strategy for Composite Optimization over Distributed Features

A Novel Regularization Based on the Error Function for Sparse Recovery

Accelerated Schemes for the $L_1/L_2$ Minimization

Efficient Hyperparameter Optimization in Deep Learning Using a Variable Length Genetic Algorithm

Fast algorithms for robust principal component analysis with an upper bound on the rank

PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation

A decentralized proximal-gradient method with network independent step-sizes and separated convergence rates

A Double Residual Compression Algorithm for Efficient Distributed Learning

Fast Signal Recovery from Saturated Measurements by Linear Loss and Nonconvex Penalties

Mixed one-bit compressive sensing with applications to overexposure correction for CT reconstruction

On the Convergence of Asynchronous Parallel Iteration with Unbounded Delays

A Multiphase Image Segmentation Based on Fuzzy Membership Functions and L1-norm Fidelity

ARock: an Algorithmic Framework for Asynchronous Parallel Coordinate Updates

Asynchronous Multi-Task Learning

Coordinate Friendly Structures, Algorithms and Applications

Spin-Cherenkov effect in a magnetic nanostrip with interfacial Dzyaloshinskii-Moriya interaction

Self Equivalence of the Alternating Direction Method of Multipliers

An octave spanning mid-infrared frequency comb generated in a silicon nanophotonic wire waveguide

Differential femtosecond coherent Stokes and anti-Stokes Raman spectroscopy

Fast Adaptive Algorithm for Robust Evaluation of Quality of Experience

Few-cycle, Broadband, Mid-infrared Optical Parametric Oscillator Pumped by a 20-fs Ti:sapphire Laser

Nonconvex Sorted $\ell_1$ Minimization for Sparse Approximation

One condition for solution uniqueness and robustness of both l1-synthesis and l1-analysis minimizations

The Continuity of Images by Transmission Imaging Revisited

Restoration of Images Corrupted by Impulse Noise and Mixed Gaussian Impulse Noise using Blind Inpainting

The magnonic limit of domain wall propagation in ferromagnetic nanotubes