Source author record

Jiwu Shu

Jiwu Shu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Machine Learning Artificial Intelligence Discrete Mathematics Information Theory math.CO math.IT

Catalog footprint

What is connected

3works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Efficient Training on Multiple Consumer GPUs with RoundPipe

Fine-tuning Large Language Models (LLMs) on consumer-grade GPUs is highly cost-effective, yet constrained by limited GPU memory and slow PCIe interconnects. Pipeline parallelism combined with CPU offloading mitigates these hardware bottlenecks by reducing communication overhead. However, existing PP schedules suffer from an inherent limitation termed the weight binding issue. Binding uneven model stages (e.g., the LM head is large) to GPUs limits the pipeline's throughput to that of the GPU with the heaviest load, leading to severe pipeline bubbles. In this paper, we propose RoundPipe, a novel pipeline schedule that breaks the weight binding constraint on consumer GPU servers. RoundPipe treats GPUs as a pool of stateless execution workers and dynamically dispatches computation stages across devices in a round-robin manner, achieving a near-zero-bubble pipeline. To ensure training correctness and system efficiency, RoundPipe integrates a priority-aware transfer scheduling engine, a fine-grained distributed event-based synchronization protocol, and an automated layer partitioning algorithm. Evaluations on an 8$\times$ RTX 4090 server demonstrate that RoundPipe achieves 1.48--2.16$\times$ speedups over state-of-the-art baselines when fine-tuning 1.7B to 32B models. Remarkably, RoundPipe enables LoRA fine-tuning of the Qwen3-235B model with 31K sequence length on a single server. RoundPipe is publicly available as an open-source Python library with comprehensive documentation.

preprint2020arXiv

Sapphire: Automatic Configuration Recommendation for Distributed Storage Systems

Modern distributed storage systems come with aplethora of configurable parameters that controlmodule behavior and affect system performance. Default settings provided by developers are often suboptimal for specific user cases. Tuning parameters can provide significant performance gains but is a difficult task requiring profound experience and expertise, due to the immense number of configurable parameters, complex inner dependencies and non-linearsystem behaviors. To overcome these difficulties, we propose an automatic simulation-based approach, Sapphire, to recommend optimal configurations by leveraging machine learning and black-box optimization techniques. We evaluate Sapphire on Ceph. Results show that Sapphire significantly boosts Ceph performance to 2.2x compared to the default configuration.

preprint2012arXiv

C-Codes: Cyclic Lowest-Density MDS Array Codes Constructed Using Starters for RAID 6

The distance-3 cyclic lowest-density MDS array code (called the C-Code) is a good candidate for RAID 6 because of its optimal storage efficiency, optimal update complexity, optimal length, and cyclic symmetry. In this paper, the underlying connections between C-Codes (or quasi-C-Codes) and starters in group theory are revealed. It is shown that each C-Code (or quasi-C-Code) of length $2n$ can be constructed using an even starter (or even multi-starter) in $(Z_{2n},+)$. It is also shown that each C-Code (or quasi-C-Code) has a twin C-Code (or quasi-C-Code). Then, four infinite families (three of which are new) of C-Codes of length $p-1$ are constructed, where $p$ is a prime. Besides the family of length $p-1$, C-Codes for some sporadic even lengths are also presented. Even so, there are still some even lengths (such as 8) for which C-Codes do not exist. To cover this limitation, two infinite families (one of which is new) of quasi-C-Codes of length $2(p-1)$ are constructed for these even lengths.