Source author record

Xiaorui Sun

Xiaorui Sun appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Machine Learning math.ST Statistics Theory Computational Complexity Computer Science and Game Theory Discrete Mathematics Distributed, Parallel, and Cluster Computing math.CO math.GR math.PR

Catalog footprint

What is connected

11works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Fully Dynamic s-t Edge Connectivity in Subpolynomial Time

We present a deterministic fully dynamic algorithm to answer $c$-edge connectivity queries on pairs of vertices in $n^{o(1)}$ worst case update and query time for any positive integer $c = (\log n)^{o(1)}$ for a graph with $n$ vertices. Previously, only polylogarithmic and $O(\sqrt{n})$ worst case update time fully dynamic algorithms were known for answering $1$, $2$ and $3$-edge connectivity queries respectively [Henzinger and King 1995, Frederikson 1997, Galil and Italiano 1991]. Our result extends the $c$-edge connectivity vertex sparsifier [Chalermsook et al. 2021] to a multi-level sparsification framework. As our main technical contribution, we present a novel update algorithm for the multi-level $c$-edge connectivity vertex sparsifier with subpolynomial update time.

preprint2022arXiv

Minor Sparsifiers and the Distributed Laplacian Paradigm

We study distributed algorithms built around minor-based vertex sparsifiers, and give the first algorithm in the CONGEST model for solving linear systems in graph Laplacian matrices to high accuracy. Our Laplacian solver has a round complexity of $O(n^{o(1)}(\sqrt{n}+D))$, and thus almost matches the lower bound of $\widetildeΩ(\sqrt{n}+D)$, where $n$ is the number of nodes in the network and $D$ is its diameter. We show that our distributed solver yields new sublinear round algorithms for several cornerstone problems in combinatorial optimization. This is achieved by leveraging the powerful algorithmic framework of Interior Point Methods (IPMs) and the Laplacian paradigm in the context of distributed graph algorithms, which entails numerically solving optimization problems on graphs via a series of Laplacian systems. Problems that benefit from our distributed algorithmic paradigm include exact mincost flow, negative weight shortest paths, maxflow, and bipartite matching on sparse directed graphs. For the maxflow problem, this is the first exact distributed algorithm that applies to directed graphs, while the previous work by [Ghaffari et al. SICOMP'18] considered the approximate setting and works only for undirected graphs. For the mincost flow and the negative weight shortest path problems, our results constitute the first exact distributed algorithms running in a sublinear number of rounds. Given that the hybrid between IPMs and the Laplacian paradigm has proven useful for tackling numerous optimization problems in the centralized setting, we believe that our distributed solver will find future applications.

preprint2020arXiv

Approximating LCS in Linear Time: Beating the $\sqrt{n}$ Barrier

Longest common subsequence (LCS) is one of the most fundamental problems in combinatorial optimization. Apart from theoretical importance, LCS has enormous applications in bioinformatics, revision control systems, and data comparison programs. Although a simple dynamic program computes LCS in quadratic time, it has been recently proven that the problem admits a conditional lower bound and may not be solved in truly subquadratic time. In addition to this, LCS is notoriously hard with respect to approximation algorithms. Apart from a trivial sampling technique that obtains a $n^{x}$ approximation solution in time $O(n^{2-2x})$ nothing else is known for LCS. This is in sharp contrast to its dual problem edit distance for which several linear time solutions are obtained in the past two decades.

preprint2020arXiv

Fast Noise Removal for $k$-Means Clustering

This paper considers $k$-means clustering in the presence of noise. It is known that $k$-means clustering is highly sensitive to noise, and thus noise should be removed to obtain a quality solution. A popular formulation of this problem is called $k$-means clustering with outliers. The goal of $k$-means clustering with outliers is to discard up to a specified number $z$ of points as noise/outliers and then find a $k$-means solution on the remaining data. The problem has received significant attention, yet current algorithms with theoretical guarantees suffer from either high running time or inherent loss in the solution quality. The main contribution of this paper is two-fold. Firstly, we develop a simple greedy algorithm that has provably strong worst case guarantees. The greedy algorithm adds a simple preprocessing step to remove noise, which can be combined with any $k$-means clustering algorithm. This algorithm gives the first pseudo-approximation-preserving reduction from $k$-means with outliers to $k$-means without outliers. Secondly, we show how to construct a coreset of size $O(k \log n)$. When combined with our greedy algorithm, we obtain a scalable, near linear time algorithm. The theoretical contributions are verified experimentally by demonstrating that the algorithm quickly removes noise and obtains a high-quality clustering.

preprint2020arXiv

On the Hardness of Massively Parallel Computation

We investigate whether there are inherent limits of parallelization in the (randomized) massively parallel computation (MPC) model by comparing it with the (sequential) RAM model. As our main result, we show the existence of hard functions that are essentially not parallelizable in the MPC model. Based on the widely-used random oracle methodology in cryptography with a cryptographic hash function $h:\{0,1\}^n \rightarrow \{0,1\}^n$ computable in time $t_h$, we show that there exists a function that can be computed in time $O(T\cdot t_h)$ and space $S$ by a RAM algorithm, but any MPC algorithm with local memory size $s < S/c$ for some $c>1$ requires at least $\tildeΩ(T)$ rounds to compute the function, even in the average case, for a wide range of parameters $n \leq S \leq T \leq 2^{n^{1/4}}$. Our result is almost optimal in the sense that by taking $T$ to be much larger than $t_h$, \textit{e.g.}, $T$ to be sub-exponential in $t_h$, to compute the function, the round complexity of any MPC algorithm with small local memory size is asymptotically the same (up to a polylogarithmic factor) as the time complexity of the RAM algorithm. Our result is obtained by adapting the so-called compression argument from the data structure lower bounds and cryptography literature to the context of massively parallel computation.

preprint2016arXiv

Structure and automorphisms of primitive coherent configurations

Coherent configurations (CCs) are highly regular colorings of the set of ordered pairs of a "vertex set"; each color represents a "constituent digraph." CCs arise in the study of permutation groups, combinatorial structures such as partially balanced designs, and the analysis of algorithms; their history goes back to Schur in the 1930s. A CC is primitive (PCC) if all its constituent digraphs are connected. We address the problem of classifying PCCs with large automorphism groups. This project was started in Babai's 1981 paper in which he showed that only the trivial PCC admits more than $\exp(\tilde{O}(n^{1/2}))$ automorphisms. (Here, $n$ is the number of vertices and the $\tilde{O}$ hides polylogarithmic factors.) In the present paper we classify all PCCs with more than $\exp(\tilde{O}(n^{1/3}))$ automorphisms, making the first progress on Babai's conjectured classification of all PCCs with more than $\exp(n^ε)$ automorphisms. A corollary to Babai's 1981 result solved a then 100-year-old problem on primitive but not doubly transitive permutation groups, giving an $\exp(\tilde{O}(n^{1/2}))$ bound on their order. In a similar vein, our result implies an $\exp(\tilde{O}(n^{1/3}))$ upper bound on the order of such groups, with known exceptions. This improvement of Babai's result was previously known only through the Classification of Finite Simple Groups (Cameron, 1981), while our proof, like Babai's, is elementary and almost purely combinatorial. Our analysis relies on a new combinatorial structure theory we develop for PCCs. In particular, we demonstrate the presence of "asymptotically uniform clique geometries" on PCCs in a certain range of the parameters.

preprint2014arXiv

Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms

Let $p$ be an unknown and arbitrary probability distribution over $[0,1)$. We consider the problem of {\em density estimation}, in which a learning algorithm is given i.i.d. draws from $p$ and must (with high probability) output a hypothesis distribution that is close to $p$. The main contribution of this paper is a highly efficient density estimation algorithm for learning using a variable-width histogram, i.e., a hypothesis distribution with a piecewise constant probability density function. In more detail, for any $k$ and $ε$, we give an algorithm that makes $\tilde{O}(k/ε^2)$ draws from $p$, runs in $\tilde{O}(k/ε^2)$ time, and outputs a hypothesis distribution $h$ that is piecewise constant with $O(k \log^2(1/ε))$ pieces. With high probability the hypothesis $h$ satisfies $d_{\mathrm{TV}}(p,h) \leq C \cdot \mathrm{opt}_k(p) + ε$, where $d_{\mathrm{TV}}$ denotes the total variation distance (statistical distance), $C$ is a universal constant, and $\mathrm{opt}_k(p)$ is the smallest total variation distance between $p$ and any $k$-piecewise constant distribution. The sample size and running time of our algorithm are optimal up to logarithmic factors. The "approximation factor" $C$ in our result is inherent in the problem, as we prove that no algorithm with sample size bounded in terms of $k$ and $ε$ can achieve $C<2$ regardless of what kind of hypothesis distribution it uses.

preprint2013arXiv

Efficient Density Estimation via Piecewise Polynomial Approximation

We give a highly efficient "semi-agnostic" algorithm for learning univariate probability distributions that are well approximated by piecewise polynomial density functions. Let $p$ be an arbitrary distribution over an interval $I$ which is $τ$-close (in total variation distance) to an unknown probability distribution $q$ that is defined by an unknown partition of $I$ into $t$ intervals and $t$ unknown degree-$d$ polynomials specifying $q$ over each of the intervals. We give an algorithm that draws $\tilde{O}(t\new{(d+1)}/\eps^2)$ samples from $p$, runs in time $\poly(t,d,1/\eps)$, and with high probability outputs a piecewise polynomial hypothesis distribution $h$ that is $(O(τ)+\eps)$-close (in total variation distance) to $p$. This sample complexity is essentially optimal; we show that even for $τ=0$, any algorithm that learns an unknown $t$-piecewise degree-$d$ probability distribution over $I$ to accuracy $\eps$ must use $Ω({\frac {t(d+1)} {\poly(1 + \log(d+1))}} \cdot {\frac 1 {\eps^2}})$ samples from the distribution, regardless of its running time. Our algorithm combines tools from approximation theory, uniform convergence, linear programming, and dynamic programming. We apply this general algorithm to obtain a wide range of results for many natural problems in density estimation over both continuous and discrete domains. These include state-of-the-art results for learning mixtures of log-concave distributions; mixtures of $t$-modal distributions; mixtures of Monotone Hazard Rate distributions; mixtures of Poisson Binomial Distributions; mixtures of Gaussians; and mixtures of $k$-monotone densities. Our general technique yields computationally efficient algorithms for all these problems, in many cases with provably optimal sample complexities (up to logarithmic factors) in all parameters.

preprint2012arXiv

Learning mixtures of structured distributions over discrete domains

Let $\mathfrak{C}$ be a class of probability distributions over the discrete domain $[n] = \{1,...,n\}.$ We show that if $\mathfrak{C}$ satisfies a rather general condition -- essentially, that each distribution in $\mathfrak{C}$ can be well-approximated by a variable-width histogram with few bins -- then there is a highly efficient (both in terms of running time and sample complexity) algorithm that can learn any mixture of $k$ unknown distributions from $\mathfrak{C}.$ We analyze several natural types of distributions over $[n]$, including log-concave, monotone hazard rate and unimodal distributions, and show that they have the required structural property of being well-approximated by a histogram with few bins. Applying our general algorithm, we obtain near-optimally efficient algorithms for all these mixture learning problems.

preprint2011arXiv

Information Dissemination via Random Walks in d-Dimensional Space

We study a natural information dissemination problem for multiple mobile agents in a bounded Euclidean space. Agents are placed uniformly at random in the $d$-dimensional space $\{-n, ..., n\}^d$ at time zero, and one of the agents holds a piece of information to be disseminated. All the agents then perform independent random walks over the space, and the information is transmitted from one agent to another if the two agents are sufficiently close. We wish to bound the total time before all agents receive the information (with high probability). Our work extends Pettarin et al.'s work (Infectious random walks, arXiv:1007.1604v2, 2011), which solved the problem for $d \leq 2$. We present tight bounds up to polylogarithmic factors for the case $d = 3$. (While our results extend to higher dimensions, for space and readability considerations we provide only the case $d=3$ here.) Our results show the behavior when $d \geq 3$ is qualitatively different from the case $d \leq 2$. In particular, as the ratio between the volume of the space and the number of agents varies, we show an interesting phase transition for three dimensions that does not occur in one or two dimensions.

preprint2011arXiv

Optimal Pricing in Social Networks with Incomplete Information

In revenue maximization of selling a digital product in a social network, the utility of an agent is often considered to have two parts: a private valuation, and linearly additive influences from other agents. We study the incomplete information case where agents know a common distribution about others' private valuations, and make decisions simultaneously. The "rational behavior" of agents in this case is captured by the well-known Bayesian Nash equilibrium. Two challenging questions arise: how to compute an equilibrium and how to optimize a pricing strategy accordingly to maximize the revenue assuming agents follow the equilibrium? In this paper, we mainly focus on the natural model where the private valuation of each agent is sampled from a uniform distribution, which turns out to be already challenging. Our main result is a polynomial-time algorithm that can exactly compute the equilibrium and the optimal price, when pairwise influences are non-negative. If negative influences are allowed, computing any equilibrium even approximately is PPAD-hard. Our algorithm can also be used to design an FPTAS for optimizing discriminative price profile.

Xiaorui Sun

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Fully Dynamic s-t Edge Connectivity in Subpolynomial Time

Minor Sparsifiers and the Distributed Laplacian Paradigm

Approximating LCS in Linear Time: Beating the $\sqrt{n}$ Barrier

Fast Noise Removal for $k$-Means Clustering

On the Hardness of Massively Parallel Computation

Structure and automorphisms of primitive coherent configurations

Near-Optimal Density Estimation in Near-Linear Time Using Variable-Width Histograms

Efficient Density Estimation via Piecewise Polynomial Approximation

Learning mixtures of structured distributions over discrete domains

Information Dissemination via Random Walks in d-Dimensional Space

Optimal Pricing in Social Networks with Incomplete Information