Source author record

Yongchao Liu

Yongchao Liu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing math.OC Machine Learning Genomics Artificial Intelligence Computation and Language Software Engineering

Catalog footprint

What is connected

11works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

AutoRAGTuner: A Declarative Framework for Automatic Optimization of RAG Pipelines

Retrieval-Augmented Generation (RAG) enhances LLMs, but performance is highly sensitive to complex architecture designs and hyper-parameter configurations, which currently rely on inefficient manual tuning. We present AutoRAGTuner, a declarative, configuration-driven framework that automates the RAG life cycle: construction, execution,evaluation, and optimization. AutoRAGTuner employs a modular architecture to decouple pipeline stages through a component registration mechanism. To unify heterogeneous data, we introduce the Domain-Element Model (DEM), representing objects as atomic elements with bidirectional pointers to support nodes, edges, and hyperedges. Furthermore, AutoRAGTuner integrates an adaptive Bayesian optimization engine for end-to-end hyper-parameter tuning. Experimental results demonstrate AutoRAGTuner's architectural generality: across diverse RAG pipelines, ranging from vanilla to graph-based, the framework consistently outperforms default baselines. Notably, AutoRAGTuner significantly mitigates engineering overhead, where its declarative configuration language enables a up to 95\% reduction in code churn for architectural adjustments. Overall, AutoRAGTuner provides a systematically optimizable foundation for building evolvable and reusable RAG systems.

preprint2023arXiv

GraphTheta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy

Graph neural networks (GNNs) have been demonstrated as a powerful tool for analyzing non-Euclidean graph data. However, the lack of efficient distributed graph learning systems severely hinders applications of GNNs, especially when graphs are big and GNNs are relatively deep. Herein, we present GraphTheta, the first distributed and scalable graph learning system built upon vertex-centric distributed graph processing with neural network operators implemented as user-defined functions. This system supports multiple training strategies and enables efficient and scalable big-graph learning on distributed (virtual) machines with low memory. To facilitate graph convolutions, GraphTheta puts forward a new graph learning abstraction named NN-TGAR to bridge the gap between graph processing and graph deep learning. A distributed graph engine is proposed to conduct the stochastic gradient descent optimization with a hybrid-parallel execution, and a new cluster-batched training strategy is supported. We evaluate GraphTheta using several datasets with network sizes ranging from small-, modest- to large-scale. Experimental results show that GraphTheta can scale well to 1,024 workers for training an in-house developed GNN on an industry-scale Alipay dataset of 1.4 billion nodes and 4.1 billion attributed edges, with a cluster of CPU virtual machines (dockers) of small memory each (5$\sim$12GB). Moreover, GraphTheta can outperform DistDGL by up to $2.02\times$, with better scalability, and GraphLearn by up to $30.56\times$. As for model accuracy, GraphTheta is capable of learning as good GNNs as existing frameworks. To the best of our knowledge, this work presents the largest edge-attributed GNN learning task in the literature.

preprint2022arXiv

Asymptotic Properties of $\mathcal{S}$-$\mathcal{AB}$ Method with Diminishing Stepsize

The popular $\mathcal{AB}$/push-pull method for distributed optimization problem may unify much of the existing decentralized first-order methods based on gradient tracking technique. More recently, the stochastic gradient variant of $\mathcal{AB}$/Push-Pull method ($\mathcal{S}$-$\mathcal{AB}$) has been proposed, which achieves the linear rate of converging to a neighborhood of the global minimizer when the step-size is constant. This paper is devoted to the asymptotic properties of $\mathcal{S}$-$\mathcal{AB}$ with diminishing stepsize. Specifically, under the condition that each local objective is smooth and the global objective is strongly-convex, we first present the boundedness of the iterates of $\mathcal{S}$-$\mathcal{AB}$ and then show that the iterates converge to the global minimizer with the rate $\mathcal{O}\left(1/\sqrt{k}\right)$. Furthermore, the asymptotic normality of Polyak-Ruppert averaged $\mathcal{S}$-$\mathcal{AB}$ is obtained and applications on statistical inference are discussed. Finally, numerical tests are conducted to demonstrate the theoretic results.

preprint2022arXiv

Distributed Stochastic Compositional Optimization Problems over Directed Networks

We study the distributed stochastic compositional optimization problems over directed communication networks in which agents privately own a stochastic compositional objective function and collaborate to minimize the sum of all objective functions. We propose a distributed stochastic compositional gradient descent method, where the gradient tracking and the stochastic correction techniques are employed to adapt to the networks' directed structure and increase the accuracy of inner function estimation. When the objective function is smooth, the proposed method achieves the convergence rate $\mathcal{O}\left(k^{-1/2}\right)$ and sample complexity $\mathcal{O}\left(\frac{1}{ε^2}\right)$ for finding the ($ε$)-stationary point. When the objective function is strongly convex, the convergence rate is improved to $\mathcal{O}\left(k^{-1}\right)$. Moreover, the asymptotic normality of Polyak-Ruppert averaged iterates of the proposed method is also presented. We demonstrate the empirical performance of the proposed method on model-agnostic meta-learning problem and logistic regression problem.

preprint2022arXiv

Stochastic Approximation Based Confidence Regions for Stochastic Variational Inequalities

The sample average approximation (SAA) and the stochastic approximation (SA) are two popular schemes for solving the stochastic variational inequalities problem (SVIP). In the past decades, theories on the consistency of the SAA solutions and SA solutions have been well studied. More recently, the asymptotic confidence regions of the true solution to SVIP have been constructed when the SAA scheme is implemented. It is of fundamental interest to develop confidence regions of the true solution to the SVIP when the SA scheme is employed. In this paper, we discuss the framework of constructing asymptotic confidence regions for the true solution of SVIP with a focus on stochastic dual average method. We first establish the asymptotic normality of the SA solutions both in ergodic sense and non-ergodic sense. Then the online methods of estimating the covariance matrices in the normal distributions are studied. Finally, practical procedures of building the asymptotic confidence regions of solutions to SVIP with numerical simulations are presented.

preprint2020arXiv

Asymptotic properties of dual averaging algorithm for constrained distributed stochastic optimization

Considering the constrained stochastic optimization problem over a time-varying random network, where the agents are to collectively minimize a sum of objective functions subject to a common constraint set, we investigate asymptotic properties of a distributed algorithm based on dual averaging of gradients. Different from most existing works on distributed dual averaging algorithms that mainly concentrating on their non-asymptotic properties, we not only prove almost sure convergence and the rate of almost sure convergence, but also asymptotic normality and asymptotic efficiency of the algorithm. Firstly, for general constrained convex optimization problem distributed over a random network, we prove that almost sure consensus can be archived and the estimates of agents converge to the same optimal point. For the case of linear constrained convex optimization, we show that the mirror map of the averaged dual sequence identifies the active constraints of the optimal solution with probability 1, which helps us to prove the almost sure convergence rate and then establish asymptotic normality of the algorithm. Furthermore, we also verify that the algorithm is asymptotically optimal. To the best of our knowledge, it seems to be the first asymptotic normality result for constrained distributed optimization algorithms. Finally, a numerical example is provided to justify the theoretical analysis.

preprint2020arXiv

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Accelerating deep model training and inference is crucial in practice. Existing deep learning frameworks usually concentrate on optimizing training speed and pay fewer attentions to inference-specific optimizations. Actually, model inference differs from training in terms of computation, e.g. parameters are refreshed each gradient update step during training, but kept invariant during inference. These special characteristics of model inference open new opportunities for its optimization. In this paper, we propose a hardware-aware optimization framework, namely Woodpecker-DL (WPK), to accelerate inference by taking advantage of multiple joint optimizations from the perspectives of graph optimization, automated searches, domain-specific language (DSL) compiler techniques and system-level exploration. In WPK, we investigated two new automated search approaches based on genetic algorithm and reinforcement learning, respectively, to hunt the best operator code configurations targeting specific hardware. A customized DSL compiler is further attached to these search algorithms to generate efficient codes. To create an optimized inference plan, WPK systematically explores high-speed operator implementations from third-party libraries besides our automatically generated codes and singles out the best implementation per operator for use. Extensive experiments demonstrated that on a Tesla P100 GPU, we can achieve the maximum speedup of 5.40 over cuDNN and 1.63 over TVM on individual convolution operators, and run up to 1.18 times faster than TensorRT for end-to-end model inference.

preprint2016arXiv

LightScan: Faster Scan Primitive on CUDA Compatible Manycore Processors

Scan (or prefix sum) is a fundamental and widely used primitive in parallel computing. In this paper, we present LightScan, a faster parallel scan primitive for CUDA-enabled GPUs, which investigates a hybrid model combining intra-block computation and inter-block communication to perform a scan. Our algorithm employs warp shuffle functions to implement fast intra-block computation and takes advantage of globally coherent L2 cache and the associated parallel thread execution (PTX) assembly instructions to realize lightweight inter-block communication. Performance evaluation using a single Tesla K40c GPU shows that LightScan outperforms existing GPU algorithms and implementations, and yields a speedup of up to 2.1, 2.4, 1.5 and 1.2 over the leading CUDPP, Thrust, ModernGPU and CUB implementations running on the same GPU, respectively. Furthermore, LightScan runs up to 8.9 and 257.3 times faster than Intel TBB running on 16 CPU cores and an Intel Xeon Phi 5110P coprocessor, respectively. Source code of LightScan is available at http://cupbb.sourceforge.net.

preprint2016arXiv

Parallel Pairwise Correlation Computation On Intel Xeon Phi Clusters

Co-expression network is a critical technique for the identification of inter-gene interactions, which usually relies on all-pairs correlation (or similar measure) computation between gene expression profiles across multiple samples. Pearson's correlation coefficient (PCC) is one widely used technique for gene co-expression network construction. However, all-pairs PCC computation is computationally demanding for large numbers of gene expression profiles, thus motivating our acceleration of its execution using high-performance computing. In this paper, we present LightPCC, the first parallel and distributed all-pairs PCC computation on Intel Xeon Phi (Phi) clusters. It achieves high speed by exploring the SIMD-instruction-level and thread-level parallelism within Phis as well as accelerator-level parallelism among multiple Phis. To facilitate balanced workload distribution, we have proposed a general framework for symmetric all-pairs computation by building bijective functions between job identifier and coordinate space for the first time. We have evaluated LightPCC and compared it to two CPU-based counterparts: a sequential C++ implementation in ALGLIB and an implementation based on a parallel general matrix-matrix multiplication routine in Intel Math Kernel Library (MKL) (all use double precision), using a set of gene expression datasets. Performance evaluation revealed that with one 5110P Phi and 16 Phis, LightPCC runs up to $20.6\times$ and $218.2\times$ faster than ALGLIB, and up to $6.8\times$ and $71.4\times$ faster than single-threaded MKL, respectively. In addition, LightPCC demonstrated good parallel scalability in terms of number of Phis. Source code of LightPCC is publicly available at http://lightpcc.sourceforge.net.

preprint2014arXiv

SWAPHI: Smith-Waterman Protein Database Search on Xeon Phi Coprocessors

The maximal sensitivity of the Smith-Waterman (SW) algorithm has enabled its wide use in biological sequence database search. Unfortunately, the high sensitivity comes at the expense of quadratic time complexity, which makes the algorithm computationally demanding for big databases. In this paper, we present SWAPHI, the first parallelized algorithm employing Xeon Phi coprocessors to accelerate SW protein database search. SWAPHI is designed based on the scale-and-vectorize approach, i.e. it boosts alignment speed by effectively utilizing both the coarse-grained parallelism from the many co-processing cores (scale) and the fine-grained parallelism from the 512-bit wide single instruction, multiple data (SIMD) vectors within each core (vectorize). By searching against the large UniProtKB/TrEMBL protein database, SWAPHI achieves a performance of up to 58.8 billion cell updates per second (GCUPS) on one coprocessor and up to 228.4 GCUPS on four coprocessors. Furthermore, it demonstrates good parallel scalability on varying number of coprocessors, and is also superior to both SWIPE on 16 high-end CPU cores and BLAST+ on 8 cores when using four coprocessors, with the maximum speedup of 1.52 and 1.86, respectively. SWAPHI is written in C++ language (with a set of SIMD intrinsics), and is freely available at http://swaphi.sourceforge.net.

preprint2013arXiv

High-speed and accurate color-space short-read alignment with CUSHAW2

Summary: We present an extension of CUSHAW2 for fast and accurate alignments of SOLiD color-space short-reads. Our extension introduces a double-seeding approach to improve mapping sensitivity, by combining maximal exact match seeds and variable-length seeds derived from local alignments. We have compared the performance of CUSHAW2 to SHRiMP2 and BFAST by aligning both simulated and real color-space mate-paired reads to the human genome. The results show that CUSHAW2 achieves comparable or better alignment quality compared to SHRiMP2 and BFAST at an order-of-magnitude faster speed and significantly smaller peak resident memory size. Availability: CUSHAW2 and all simulated datasets are available at http://cushaw2.sourceforge.net. Contact: liuy@uni-mainz.de; bertil.schmidt@uni-mainz.de

Yongchao Liu

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

AutoRAGTuner: A Declarative Framework for Automatic Optimization of RAG Pipelines

GraphTheta: A Distributed Graph Neural Network Learning System With Flexible Training Strategy

Asymptotic Properties of $\mathcal{S}$-$\mathcal{AB}$ Method with Diminishing Stepsize

Distributed Stochastic Compositional Optimization Problems over Directed Networks

Stochastic Approximation Based Confidence Regions for Stochastic Variational Inequalities

Asymptotic properties of dual averaging algorithm for constrained distributed stochastic optimization

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

LightScan: Faster Scan Primitive on CUDA Compatible Manycore Processors

Parallel Pairwise Correlation Computation On Intel Xeon Phi Clusters

SWAPHI: Smith-Waterman Protein Database Search on Xeon Phi Coprocessors

High-speed and accurate color-space short-read alignment with CUSHAW2