Source author record

Hongwei Sun

Hongwei Sun appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.DG Distributed, Parallel, and Cluster Computing Machine Learning math.MG math.ST Statistics Theory

Catalog footprint

What is connected

8works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Relay Buffer Independent Communication over Pooled HBM for Efficient MoE Inference on Ascend

Mixture-of-Experts (MoE) inference requires large-scale token exchange across devices, making dispatch and combine major bottlenecks in both prefill and decode. Beyond network transfer, routing-driven layout transformation, temporary relay, and output restoration can add substantial overhead. Existing MoE communication paths are often buffer-centric, using explicit inter-process relay and reordering buffers around collective transfer. This report presents a relay-buffer-free communication design for MoE inference acceleration on Ascend systems. The design reorganizes dispatch and combine around direct placement into destination expert windows and direct reading from remote expert windows. Built on globally pooled high-bandwidth memory and symmetric-memory allocation, it removes most intermediate relay and reordering buffers while retaining only lightweight control state, including counts, offsets, and synchronization metadata. We instantiate the design as two schedules for the main phases of MoE inference: a prefill schedule with richer planning state for throughput-oriented execution, and a compact decode schedule for latency-sensitive execution. Experiments on Ascend-based MoE workloads show reduced dispatch and combine latency in both settings. At the serving level, the implementation improves time to first token (TTFT), preserves competitive time per output token (TPOT), and enlarges the feasible scheduling space under practical latency constraints. These results indicate that, on platforms with globally addressable device memory, reducing intermediate buffering and output restoration around expert execution is an effective direction for accelerating MoE inference.

preprint2020arXiv

Optimal Rates of Distributed Regression with Imperfect Kernels

Distributed machine learning systems have been receiving increasing attentions for their efficiency to process large scale data. Many distributed frameworks have been proposed for different machine learning tasks. In this paper, we study the distributed kernel regression via the divide and conquer approach. This approach has been proved asymptotically minimax optimal if the kernel is perfectly selected so that the true regression function lies in the associated reproducing kernel Hilbert space. However, this is usually, if not always, impractical because kernels that can only be selected via prior knowledge or a tuning process are hardly perfect. Instead it is more common that the kernel is good enough but imperfect in the sense that the true regression can be well approximated by but does not lie exactly in the kernel space. We show distributed kernel regression can still achieves capacity independent optimal rate in this case. To this end, we first establish a general framework that allows to analyze distributed regression with response weighted base algorithms by bounding the error of such algorithms on a single data set, provided that the error bounds has factored the impact of the unexplained variance of the response variable. Then we perform a leave one out analysis of the kernel ridge regression and bias corrected kernel ridge regression, which in combination with the aforementioned framework allows us to derive sharp error bounds and capacity independent optimal rates for the associated distributed kernel regression algorithms. As a byproduct of the thorough analysis, we also prove the kernel ridge regression can achieve rates faster than $N^{-1}$ (where $N$ is the sample size) in the noise free setting which, to our best knowledge, are first observed and novel in regression learning.

preprint2020arXiv

Quasi-convex subsets in Alexandrov spaces with lower curvature bound

In this paper, we introduce quasi-convex subsets in Alxandrov spaces with lower curvature bound, which include not only all closed convex subsets without boundary but also all extremal subsets. Moreover, we explore several essential properties of such kind of subsets including a generalized Liberman theorem. It turns out that the quasi-convex subset is a nice and fundamental concept to illustrate the similarities and differences between Riemannian manifolds and Alxandrov spaces with lower curvature bound.

preprint2016arXiv

An Isometrical ${\Bbb C\Bbb P}^{n}$-Theorem

Let $M^n\ (n\geq3)$ be a complete Riemannian manifold with $\sec_M\geq 1$, and let $M_i^{n_i}$ ($i=1,2$) be two comlplete totally geodesic submanifolds in $M$. We prove that if $n_1+n_2=n-2$ and if the distance $|M_1M_2|\geq\fracπ{2}$, then $M_i$ is isometric to $\Bbb S^{n_i}/\Bbb Z_h$, ${\Bbb C\Bbb P}^{\frac {n_i}2}$ or ${\Bbb C\Bbb P}^{\frac {n_i}2}/\Bbb Z_2$ with the canonical metric when $n_i>0$, and thus $M$ is isometric to $\Bbb S^n/\Bbb Z_h$, ${\Bbb C\Bbb P}^{\frac n2}$ or ${\Bbb C\Bbb P}^{\frac n2}/\Bbb Z_2$ except possibly when $n=3$ and $M_1$ (or $M_2$) $\stackrel{\rm iso}{\cong}\Bbb S^{1}/\Bbb Z_h$ with $h\geq 2$ or $n=4$ and $M_1$ (or $M_2$) $\stackrel{\rm iso}{\cong}\Bbb{RP}^2$.

preprint2015arXiv

On the Blaschke's Conjecture

The Blaschke's conjecture asserts that if $\diam(M)=\text{Inj}(M)=\frac\pi2$ (up to a rescaling) for a complete Riemannian manifold $M$, then $M$ is isometric to $\Bbb S^n(\frac12)$, ${\Bbb R\Bbb P}^{n}$, ${\Bbb C\Bbb P}^{n}$, ${\Bbb H\Bbb P}^{n}$ or ${\Bbb Ca\Bbb P}^{2}$ endowed with the canonical metric. In the paper, we prove that the conjecture is true if we in addition assume that $\sec_M\geq1$.

preprint2014arXiv

On $\frac\pi2$-separated subsets of Alexandrov spaces with curvature $\geq1$

Let $M$ be an $n$-dimensional Alexandrov space with curvature $\geq 1$, and let $\{q_1,\cdots,q_k\}$ be any $\frac\pi2$-separated subset in $M$ (i.e. the distance $|q_iq_j|\geq\fracπ{2}$ for any $i\neq j$). Under the additional conditions "$|q_iq_j|<π$" and "the diameter $\diam(M)\leq \frac\pi2$", we respectively give the upper bound of $k$ (which depends only on $n$), and we classify the (topological or geometric) structure of $M$ when $k$ attains the upper bound.

preprint2013arXiv

Rigidity theorems for glued spaces being suspensions, cones and joins in Alexandrov geometry with curvature bounded below

In the paper, we give rigidity theorems when the glued space of two Alexandrov spapces with curvature bounded below is a suspension, cone or join. And we list some basic properties of joins in Appendix.

preprint2010arXiv

On Almost Isometry Theorem in Alexandrov Spaces with Curvature Bounded Below

In this paper we give a new proof for an almost isometry theorem in Alexandrov spaces with curvature bounded below.