Source author record

Feiyu Zhang

Feiyu Zhang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Distributed, Parallel, and Cluster Computing math.NA Numerical Analysis

Catalog footprint

What is connected

2works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

LLM serving platforms are increasingly deployed as multi-model cloud systems, where user demand is often long-tailed: a few popular large models receive most requests, while many smaller tail models remain underutilized. We propose \textbf{SPECTRE} (Parallel \textbf{SPEC}ulative Decoding with a Multi-\textbf{T}enant \textbf{RE}mote Drafter), a serving framework that reuses underutilized tail-model services as remote drafters for heavily loaded large-model services through speculative decoding. SPECTRE enables draft generation and target-side verification to run in parallel, and makes such parallelism effective through three techniques: a hybrid ordinary-parallel speculative decoding strategy guided by a threshold derived from throughput analysis, speculative priority scheduling to preserve draft--target overlap under multi-tenant traffic, and draft-side prompt compression to reduce draft latency. We implement SPECTRE in \texttt{SGLang} and evaluate it across multiple draft--target model pairs, reasoning benchmarks, real-world long-context workloads, and a wide range of batch sizes. Results show that SPECTRE consistently improves large-model serving throughput while causing only minor interference to the native workloads of tail-model services. In large-model deployments, including Qwen3-235B-A22B with TP=8, SPECTRE achieves up to \textbf{2.28$\times$ speedup} over autoregressive decoding and up to an additional \textbf{66\% relative improvement} over the strongest speculative decoding baselines. Talk is cheap, we show you the code: https://github.com/sgl-project/sglang/pull/22272.

preprint2022arXiv

On sampling Kaczmarz-Motzkin methods for solving large-scale nonlinear systems

In this paper, for solving large-scale nonlinear equations we propose a nonlinear sampling Kaczmarz-Motzkin (NSKM) method. Based on the local tangential cone condition and the Jensen's inequality, we prove convergence of our method with two different assumptions. Then, for solving nonlinear equations with the convex constraints we present two variants of the NSKM method: the projected sampling Kaczmarz-Motzkin (PSKM) method and the accelerated projected sampling Kaczmarz-Motzkin (APSKM) method. With the use of the nonexpansive property of the projection and the convergence of the NSKM method, the convergence analysis is obtained. Numerical results show that the NSKM method with the sample of the suitable size outperforms the nonlinear randomized Kaczmarz (NRK) method in terms of calculation times. The APSKM and PSKM methods are practical and promising for the constrained nonlinear problem.