Source author record

Xiaoming Huo

Xiaoming Huo appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Applications Methodology Computation Computation and Language Distributed, Parallel, and Cluster Computing eess.IV math.ST Statistics Theory

Catalog footprint

What is connected

8works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Federated Language Models Under Bandwidth Budgets: Distillation Rates and Conformal Coverage

Training a language model on data scattered across bandwidth-limited nodes that cannot be centralized is a setting that arises in clinical networks, enterprise knowledge bases, and scientific consortia. We study the regime in which data must remain distributed across nodes, and ask what statistical guarantees are in principle achievable under explicit bandwidth budgets; we aim to characterize what is provably possible, not to demonstrate a deployment-ready system. Existing theory treats either training-time consistency or inference-time calibration in isolation, and none makes bandwidth a first-class statistical parameter. We analyze two protocols, Federated Probe-Logit Distillation (FPLD) for training and Federated Conformal RAG (FC-RAG) for inference, as the analytical vehicles for our results. Our first main result is an explicit high-probability KL-consistency rate for FPLD with simultaneous dependence on node count $K$, per-node sample size $n$, quantization budget $B$, probe-set size $m$, and vocabulary size $V$; bandwidth enters only through an exponentially vanishing quantization term. Our second main result is a distribution-free marginal-coverage bound for FC-RAG, whose novel retrieval-bandwidth slack $Δ_{\mathrm{RAG}} = f_{\max}\sqrt{K^{-2}\sum_i v(B_i)}$ makes per-node retrieval bandwidth a first-class statistical parameter, with arithmetic aggregation across $K$ nodes shrinking the slack as $K^{-1/2}$ in the per-node-uniform regime. A Pinsker-type corollary composes the two bounds into an end-to-end coverage guarantee. Synthetic experiments verify the predicted scaling along the bounds' parameters; small-scale experiments on a GPT-2 testbed illustrate that the qualitative bandwidth-accuracy tradeoff survives on a real language model. A deployment-scale empirical evaluation is out of scope.

preprint2026arXiv

Kernel Selection is Model Selection: A Unified Complexity-Penalized Approach for MMD Two-Sample Tests

The Maximum Mean Discrepancy (MMD) is a cornerstone statistic for nonparametric two-sample testing, but its test power is dictated entirely by the chosen kernel. Because any fixed kernel inherently fails to distinguish certain distributions, the kernel must be dynamically optimized. However, data-driven optimization violates the foundational i.i.d. assumption, forcing a strict trade-off in existing frameworks. Ratio criteria ignore this dependence, inducing overfitting and variance collapse on rich kernel classes. Conversely, aggregation methods bypass the dependence using finite grids, but this strategy cannot scale to continuous search spaces like deep kernels. To break this dichotomy, we establish data-driven kernel selection as a model selection problem. We propose Complexity-Penalized MMD (CP-MMD), a criterion derived by applying the two-sample uniform concentration inequality of preceding works to the post-optimization MMD problem. The resulting penalty bounds the empirical MMD by the complexity of the kernel search space, mathematically absorbing the cost of optimization, so that CP-MMD enables direct, grid-free maximization over continuous parametric classes, including scalar bandwidths, polynomial feature bandwidths, and deep network parameters. By formally accounting for optimization complexity, we prove that CP-MMD maximizes true test power while ensuring unconditional Type-I validity. Consequently, CP-MMD enables grid-free kernel selection across linear, polynomial-feature, and deep regimes, matching or exceeding state-of-the-art test power.

preprint2022arXiv

Implicit Regularization Properties of Variance Reduced Stochastic Mirror Descent

In machine learning and statistical data analysis, we often run into objective function that is a summation: the number of terms in the summation possibly is equal to the sample size, which can be enormous. In such a setting, the stochastic mirror descent (SMD) algorithm is a numerically efficient method -- each iteration involving a very small subset of the data. The variance reduction version of SMD (VRSMD) can further improve SMD by inducing faster convergence. On the other hand, algorithms such as gradient descent and stochastic gradient descent have the implicit regularization property that leads to better performance in terms of the generalization errors. Little is known on whether such a property holds for VRSMD. We prove here that the discrete VRSMD estimator sequence converges to the minimum mirror interpolant in the linear regression. This establishes the implicit regularization property for VRSMD. As an application of the above result, we derive a model estimation accuracy result in the setting when the true model is sparse. We use numerical examples to illustrate the empirical power of VRSMD.

preprint2022arXiv

The Directional Bias Helps Stochastic Gradient Descent to Generalize in Kernel Regression Models

We study the Stochastic Gradient Descent (SGD) algorithm in nonparametric statistics: kernel regression in particular. The directional bias property of SGD, which is known in the linear regression setting, is generalized to the kernel regression. More specifically, we prove that SGD with moderate and annealing step-size converges along the direction of the eigenvector that corresponds to the largest eigenvalue of the Gram matrix. In addition, the Gradient Descent (GD) with a moderate or small step-size converges along the direction that corresponds to the smallest eigenvalue. These facts are referred to as the directional bias properties; they may interpret how an SGD-computed estimator has a potentially smaller generalization error than a GD-computed estimator. The application of our theory is demonstrated by simulation studies and a case study that is based on the FashionMNIST dataset.

preprint2020arXiv

Fast and Asymptotically Powerful Detection for Filamentary Objects in Digital Images

Given an inhomogeneous chain embedded in a noisy image, we consider the conditions under which such an embedded chain is detectable. Many applications, such as detecting moving objects, detecting ship wakes, can be abstracted as the detection on the existence of chains. In this work, we provide the detection algorithm with low order of computation complexity to detect the chain and the optimal theoretical detectability regarding SNR (signal to noise ratio) under the normal distribution model. Specifically, we derive an analytical threshold that specifies what is detectable. We design a longest significant chain detection algorithm, with computation complexity in the order of $O(n\log n)$. We also prove that our proposed algorithm is asymptotically powerful, which means, as the dimension $n \rightarrow \infty$, the probability of false detection vanishes. We further provide some simulated examples and a real data example, which validate our theory.

preprint2019arXiv

Asymptotic convergence rate of the longest run in an inflating Bernoulli net

In image detection, one problem is to test whether the set, though mostly consisting of uniformly scattered points, also contains a small fraction of points sampled from some (a priori unknown) curve, for example, a curve with $C^α$-norm bounded by $β$. One approach is to analyze the data by counting membership in multiscale multianisotropic strips, which involves an algorithm that delves into the length of the path connecting many consecutive "significant" nodes. In this paper, we develop the mathematical formalism of this algorithm and analyze the statistical property of the length of the longest significant run. The rate of convergence is derived. Using percolation theory and random graph theory, we present a novel probabilistic model named pseudo-tree model. Based on the asymptotic results for pseudo-tree model, we further study the length of the longest significant run in an "inflating" Bernoulli net. We find that the probability parameter $p$ of significant node plays an important role: there is a threshold $p_c$, such that in the cases of $p<p_c$ and $p>p_c$, very different asymptotic behaviors of the length of the significant are observed. We apply our results to the detection of an underlying curvilinear feature and argue that we achieve the lowest possible detectable strength in theory.

preprint2015arXiv

A Distributed One-Step Estimator

Distributed statistical inference has recently attracted enormous attention. Many existing work focuses on the averaging estimator. We propose a one-step approach to enhance a simple-averaging based distributed estimator. We derive the corresponding asymptotic properties of the newly proposed estimator. We find that the proposed one-step estimator enjoys the same asymptotic properties as the centralized estimator. The proposed one-step approach merely requires one additional round of communication in relative to the averaging estimator; so the extra communication burden is insignificant. In finite sample cases, numerical examples show that the proposed estimator outperforms the simple averaging estimator with a large margin in terms of the mean squared errors. A potential application of the one-step approach is that one can use multiple machines to speed up large scale statistical inference with little compromise in the quality of estimators. The proposed method becomes more valuable when data can only be available at distributed machines with limited communication bandwidth.

preprint2014arXiv

Fast Computing for Distance Covariance

Distance covariance and distance correlation have been widely adopted in measuring dependence of a pair of random variables or random vectors. If the computation of distance covariance and distance correlation is implemented directly accordingly to its definition then its computational complexity is O($n^2$) which is a disadvantage compared to other faster methods. In this paper we show that the computation of distance covariance and distance correlation of real valued random variables can be implemented by an O(n log n) algorithm and this is comparable to other computationally efficient algorithms. The new formula we derive for an unbiased estimator for squared distance covariance turns out to be a U-statistic. This fact implies some nice asymptotic properties that were derived before via more complex methods. We apply the fast computing algorithm to some synthetic data. Our work will make distance correlation applicable to a much wider class of applications.