Source author record

Jeff M. Phillips

Jeff M. Phillips appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computational Geometry Data Structures and Algorithms Databases Computational Complexity Computer Vision Distributed, Parallel, and Cluster Computing General Literature Numerical Analysis

Catalog footprint

What is connected

29works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Optimizing Computational-Statistical Runtime for Wasserstein Distance Estimation

Squared Wasserstein distance is a frequently used tool to measure discrepancy between probability distributions. This distance is typically computed between empirical measures of size $n$ from two underlying random samples. Unfortunately, even in lower dimensional Euclidean space problems $\left( d \in \{2,3\} \right)$, algorithms for Wasserstein distance computation with approximate or exact precision guarantees scale poorly in the runtime as a function of $n$ and the desired precision. In response, we consider the computational-statistical runtime, where the goal is to estimate from samples the Wasserstein distance between potentially smooth measures up to $ε$-additive error in expectation with respect to the sampling; we allow $O(1)$ computational cost for collecting a sample. Towards this, we develop a Sample-Sketch-Solve paradigm where we introduce a regular cartesian grid sketch of the samples. We show that (especially under $α$-Hölder smooth distributions) this can compress the data without increasing asymptotic error, and also regularizes the structure which enables faster exact algorithms. Ultimately, we approximate $W_2^2(P,Q)$ within $ε$ error in $ε^{-\max(2,\frac{d+1+o(1)}{1+α})}$ time for $0 < α< 1$ Hölder smooth distributions $P,Q$ on $(0,1)^{d}$; an optimal $Θ(ε^{-2})$ for $α> 1/2$ when $d=2$ and nearly optimal as $α\to 1$ when $d = 3$.

preprint2026arXiv

TabKDE: Simple and Scalable Tabular Data Generation with Kernel Density Estimates

Tabular data generation considers a large table with multiple columns -- each column comprised of numerical, categorical, or sometimes ordinal values. The goal is to produce new rows for the table that replicate the distribution of rows from the original data -- without just copying those initial rows. The last 4 years have seen enormous progress on this problem, mostly using computational expensive methods that employ one-hot encoding, VAEs, and diffusion. This paper describes a new approach to the problem of tabular data generation. By employing copula transformations and modeling the distribution as a kernel density estimate we can nearly match the accuracy and leakage-avoidance achievements of the previous methods, but with almost no training time. Our method is very scalable, and can be run on data sets orders of magnitude larger than prior state-of-the-art on a simple laptop. Moreover, because we employ kernel density estimates, we can store the model as a coreset of the original data -- we believe the first for generative modeling -- and as a result, require significantly less space as well. Our code is available here: \url{https://github.com/tabkde/tabkde-main}

preprint2020arXiv

The GaussianSketch for Almost Relative Error Kernel Distance

We introduce two versions of a new sketch for approximately embedding the Gaussian kernel into Euclidean inner product space. These work by truncating infinite expansions of the Gaussian kernel, and carefully invoking the RecursiveTensorSketch [Ahle et al. SODA 2020]. After providing concentration and approximation properties of these sketches, we use them to approximate the kernel distance between points sets. These sketches yield almost $(1+\varepsilon)$-relative error, but with a small additive $α$ term. In the first variants the dependence on $1/α$ is poly-logarithmic, but has higher degree of polynomial dependence on the original dimension $d$. In the second variant, the dependence on $1/α$ is still poly-logarithmic, but the dependence on $d$ is linear.

preprint2016arXiv

$ε$-Kernel Coresets for Stochastic Points

With the dramatic growth in the number of application domains that generate probabilistic, noisy and uncertain data, there has been an increasing interest in designing algorithms for geometric or combinatorial optimization problems over such data. In this paper, we initiate the study of constructing $ε$-kernel coresets for uncertain points. We consider uncertainty in the existential model where each point's location is fixed but only occurs with a certain probability, and the locational model where each point has a probability distribution describing its location. An $ε$-kernel coreset approximates the width of a point set in any direction. We consider approximating the expected width (an \expkernel), as well as the probability distribution on the width (an \probkernel) for any direction. We show that there exists a set of $O(1/ε^{(d-1)/2})$ deterministic points which approximate the expected width under the existential and locational models, and we provide efficient algorithms for constructing such coresets. We show, however, it is not always possible to find a subset of the original uncertain points which provides such an approximation. However, if the existential probability of each point is lower bounded by a constant, an exp-kernel\ (or an fpow-kernel) is still possible. We also construct an quant-kernel coreset in linear time. Finally, combining with known techniques, we show a few applications to approximating the extent of uncertain functions, maintaining extent measures for stochastic moving points and some shape fitting problems under uncertainty.

preprint2016arXiv

Efficient Frequent Directions Algorithm for Sparse Matrices

This paper describes Sparse Frequent Directions, a variant of Frequent Directions for sketching sparse matrices. It resembles the original algorithm in many ways: both receive the rows of an input matrix $A^{n \times d}$ one by one in the streaming setting and compute a small sketch $B \in R^{\ell \times d}$. Both share the same strong (provably optimal) asymptotic guarantees with respect to the space-accuracy tradeoff in the streaming setting. However, unlike Frequent Directions which runs in $O(nd\ell)$ time regardless of the sparsity of the input matrix $A$, Sparse Frequent Directions runs in $\tilde{O} (nnz(A)\ell + n\ell^2)$ time. Our analysis loosens the dependence on computing the Singular Value Decomposition (SVD) as a black box within the Frequent Directions algorithm. Our bounds require recent results on the properties of fast approximate SVD computations. Finally, we empirically demonstrate that these asymptotic improvements are practical and significant on real and synthetic data.

preprint2016arXiv

The Robustness of Estimator Composition

We formalize notions of robustness for composite estimators via the notion of a breakdown point. A composite estimator successively applies two (or more) estimators: on data decomposed into disjoint parts, it applies the first estimator on each part, then the second estimator on the outputs of the first estimator. And so on, if the composition is of more than two estimators. Informally, the breakdown point is the minimum fraction of data points which if significantly modified will also significantly modify the output of the estimator, so it is typically desirable to have a large breakdown point. Our main result shows that, under mild conditions on the individual estimators, the breakdown point of the composite estimator is the product of the breakdown points of the individual estimators. We also demonstrate several scenarios, ranging from regression to statistical testing, where this analysis is easy to apply, useful in understanding worst case robustness, and sheds powerful insights onto the associated data analysis.

preprint2015arXiv

Frequent Directions : Simple and Deterministic Matrix Sketching

We describe a new algorithm called Frequent Directions for deterministic matrix sketching in the row-updates model. The algorithm is presented an arbitrary input matrix $A \in R^{n \times d}$ one row at a time. It performed $O(d \times \ell)$ operations per row and maintains a sketch matrix $B \in R^{\ell \times d}$ such that for any $k < \ell$ $\|A^TA - B^TB \|_2 \leq \|A - A_k\|_F^2 / (\ell-k)$ and $\|A - π_{B_k}(A)\|_F^2 \leq \big(1 + \frac{k}{\ell-k}\big) \|A-A_k\|_F^2 $ . Here, $A_k$ stands for the minimizer of $\|A - A_k\|_F$ over all rank $k$ matrices (similarly $B_k$) and $π_{B_k}(A)$ is the rank $k$ matrix resulting from projecting $A$ on the row span of $B_k$. We show both of these bounds are the best possible for the space allowed. The summary is mergeable, and hence trivially parallelizable. Moreover, Frequent Directions outperforms exemplar implementations of existing streaming algorithms in the space-error tradeoff.

preprint2015arXiv

Geometric Inference on Kernel Density Estimates

We show that geometric inference of a point cloud can be calculated by examining its kernel density estimate with a Gaussian kernel. This allows one to consider kernel density estimates, which are robust to spatial noise, subsampling, and approximate computation in comparison to raw point sets. This is achieved by examining the sublevel sets of the kernel distance, which isomorphically map to superlevel sets of the kernel density estimate. We prove new properties about the kernel distance, demonstrating stability results and allowing it to inherit reconstruction results from recent advances in distance-based topological reconstruction. Moreover, we provide an algorithm to estimate its topology using weighted Vietoris-Rips complexes.

preprint2015arXiv

Improved Practical Matrix Sketching with Guarantees

Matrices have become essential data representations for many large-scale problems in data analytics, and hence matrix sketching is a critical task. Although much research has focused on improving the error/size tradeoff under various sketching paradigms, the many forms of error bounds make these approaches hard to compare in theory and in practice. This paper attempts to categorize and compare most known methods under row-wise streaming updates with provable guarantees, and then to tweak some of these methods to gain practical improvements while retaining guarantees. For instance, we observe that a simple heuristic iSVD, with no guarantees, tends to outperform all known approaches in terms of size/error trade-off. We modify the best performing method with guarantees FrequentDirections under the size/error trade-off to match the performance of iSVD and retain its guarantees. We also demonstrate some adversarial datasets where iSVD performs quite poorly. In comparing techniques in the time/error trade-off, techniques based on hashing or sampling tend to perform better. In this setting we modify the most studied sampling regime to retain error guarantee but obtain dramatic improvements in the time/error trade-off. Finally, we provide easy replication of our studies on APT, a new testbed which makes available not only code and datasets, but also a computing platform with fixed environmental settings.

preprint2015arXiv

Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy

In this paper we prove lower bounds on randomized multiparty communication complexity, both in the \emph{blackboard model} (where each message is written on a blackboard for all players to see) and (mainly) in the \emph{message-passing model}, where messages are sent player-to-player. We introduce a new technique for proving such bounds, called \emph{symmetrization}, which is natural, intuitive, and often easy to use. For example, for the problem where each of $k$ players gets a bit-vector of length $n$, and the goal is to compute the coordinate-wise XOR of these vectors, we prove a tight lower bounds of $Ω(nk)$ in the blackboard model. For the same problem with AND instead of XOR, we prove a lower bounds of roughly $Ω(nk)$ in the message-passing model (assuming $k \le n/3200$) and $Ω(n \log k)$ in the blackboard model. We also prove lower bounds for bit-wise majority, for a graph-connectivity problem, and for other problems; the technique seems applicable to a wide range of other problems as well. All of our lower bounds allow randomized communication protocols with two-sided error. We also use the symmetrization technique to prove several direct-sum-like results for multiparty communication.

preprint2015arXiv

Streaming Kernel Principal Component Analysis

Kernel principal component analysis (KPCA) provides a concise set of basis vectors which capture non-linear structures within large data sets, and is a central tool in data analysis and learning. To allow for non-linear relations, typically a full $n \times n$ kernel matrix is constructed over $n$ data points, but this requires too much space and time for large values of $n$. Techniques such as the Nyström method and random feature maps can help towards this goal, but they do not explicitly maintain the basis vectors in a stream and take more space than desired. We propose a new approach for streaming KPCA which maintains a small set of basis elements in a stream, requiring space only logarithmic in $n$, and also improves the dependence on the error parameter. Our technique combines together random feature maps with recent advances in matrix sketching, it has guaranteed spectral norm error bounds with respect to the original kernel matrix, and it compares favorably in practice to state-of-the-art approaches.

preprint2015arXiv

Subsampling in Smoothed Range Spaces

We consider smoothed versions of geometric range spaces, so an element of the ground set (e.g. a point) can be contained in a range with a non-binary value in $[0,1]$. Similar notions have been considered for kernels; we extend them to more general types of ranges. We then consider approximations of these range spaces through $\varepsilon $-nets and $\varepsilon $-samples (aka $\varepsilon$-approximations). We characterize when size bounds for $\varepsilon $-samples on kernels can be extended to these more general smoothed range spaces. We also describe new generalizations for $\varepsilon $-nets to these range spaces and show when results from binary range spaces can carry over to these smoothed ones.

preprint2014arXiv

Continuous Matrix Approximation on Distributed Data

Tracking and approximating data matrices in streaming fashion is a fundamental challenge. The problem requires more care and attention when data comes from multiple distributed sites, each receiving a stream of data. This paper considers the problem of "tracking approximations to a matrix" in the distributed streaming model. In this model, there are m distributed sites each observing a distinct stream of data (where each element is a row of a distributed matrix) and has a communication channel with a coordinator, and the goal is to track an eps-approximation to the norm of the matrix along any direction. To that end, we present novel algorithms to address the matrix approximation problem. Our algorithms maintain a smaller matrix B, as an approximation to a distributed streaming matrix A, such that for any unit vector x: | ||A x||^2 - ||B x||^2 | <= eps ||A||_F^2. Our algorithms work in streaming fashion and incur small communication, which is critical for distributed computation. Our best method is deterministic and uses only O((m/eps) log(beta N)) communication, where N is the size of stream (at the time of the query) and beta is an upper-bound on the squared norm of any row of the matrix. In addition to proving all algorithmic properties theoretically, extensive experiments with real large datasets demonstrate the efficiency of these protocols.

preprint2013arXiv

Chernoff-Hoeffding Inequality and Applications

When dealing with modern big data sets, a very common theme is reducing the set through a random process. These generally work by making "many simple estimates" of the full data set, and then judging them as a whole. Perhaps magically, these "many simple estimates" can provide a very accurate and small representation of the large data set. The key tool in showing how many of these simple estimates are needed for a fixed accuracy trade-off is the Chernoff-Hoeffding inequality[Che52,Hoe63]. This document provides a simple form of this bound, and two examples of its use.

preprint2013arXiv

Range Counting Coresets for Uncertain Data

We study coresets for various types of range counting queries on uncertain data. In our model each uncertain point has a probability density describing its location, sometimes defined as k distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by just examining this subset. We study three distinct types of queries. RE queries return the expected number of points in a query range. RC queries return the number of points in the range with probability at least a threshold. RQ queries returns the probability that fewer than some threshold fraction of the points are in the range. In both RC and RQ coresets the threshold is provided as part of the query. And for each type of query we provide coreset constructions with approximation-size tradeoffs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancy-based approaches on axis-aligned range queries.

preprint2013arXiv

Relative Errors for Deterministic Low-Rank Matrix Approximations

We consider processing an n x d matrix A in a stream with row-wise updates according to a recent algorithm called Frequent Directions (Liberty, KDD 2013). This algorithm maintains an l x d matrix Q deterministically, processing each row in O(d l^2) time; the processing time can be decreased to O(d l) with a slight modification in the algorithm and a constant increase in space. We show that if one sets l = k+ k/eps and returns Q_k, a k x d matrix that is the best rank k approximation to Q, then we achieve the following properties: ||A - A_k||_F^2 <= ||A||_F^2 - ||Q_k||_F^2 <= (1+eps) ||A - A_k||_F^2 and where pi_{Q_k}(A) is the projection of A onto the rowspace of Q_k then ||A - pi_{Q_k}(A)||_F^2 <= (1+eps) ||A - A_k||_F^2. We also show that Frequent Directions cannot be adapted to a sparse version in an obvious way that retains the l original rows of the matrix, as opposed to a linear combination or sketch of the rows.

preprint2013arXiv

Rethinking Abstractions for Big Data: Why, Where, How, and What

Big data refers to large and complex data sets that, under existing approaches, exceed the capacity and capability of current compute platforms, systems software, analytical tools and human understanding. Numerous lessons on the scalability of big data can already be found in asymptotic analysis of algorithms and from the high-performance computing (HPC) and applications communities. However, scale is only one aspect of current big data trends; fundamentally, current and emerging problems in big data are a result of unprecedented complexity--in the structure of the data and how to analyze it, in dealing with unreliability and redundancy, in addressing the human factors of comprehending complex data sets, in formulating meaningful analyses, and in managing the dense, power-hungry data centers that house big data. The computer science solution to complexity is finding the right abstractions, those that hide as much triviality as possible while revealing the essence of the problem that is being addressed. The "big data challenge" has disrupted computer science by stressing to the very limits the familiar abstractions which define the relevant subfields in data analysis, data management and the underlying parallel systems. As a result, not enough of these challenges are revealed by isolating abstractions in a traditional software stack or standard algorithmic and analytical techniques, and attempts to address complexity either oversimplify or require low-level management of details. The authors believe that the abstractions for big data need to be rethought, and this reorganization needs to evolve and be sustained through continued cross-disciplinary collaboration.

preprint2012arXiv

Efficient Protocols for Distributed Classification and Optimization

In distributed learning, the goal is to perform a learning task over data distributed across multiple nodes with minimal (expensive) communication. Prior work (Daume III et al., 2012) proposes a general model that bounds the communication required for learning classifiers while allowing for $\eps$ training error on linearly separable data adversarially distributed across nodes. In this work, we develop key improvements and extensions to this basic model. Our first result is a two-party multiplicative-weight-update based protocol that uses $O(d^2 \log{1/\eps})$ words of communication to classify distributed data in arbitrary dimension $d$, $\eps$-optimally. This readily extends to classification over $k$ nodes with $O(kd^2 \log{1/\eps})$ words of communication. Our proposed protocol is simple to implement and is considerably more efficient than baselines compared, as demonstrated by our empirical results. In addition, we illustrate general algorithm design paradigms for doing efficient learning over distributed data. We show how to solve fixed-dimensional and high dimensional linear programming efficiently in a distributed setting where constraints may be distributed across nodes. Since many learning problems can be viewed as convex optimization problems where constraints are generated by individual points, this models many typical distributed learning scenarios. Our techniques make use of a novel connection from multipass streaming, as well as adapting the multiplicative-weight-update framework more generally to a distributed setting. As a consequence, our methods extend to the wide range of problems solvable using these techniques.

preprint2012arXiv

epsilon-Samples of Kernels

We study the worst case error of kernel density estimates via subset approximation. A kernel density estimate of a distribution is the convolution of that distribution with a fixed kernel (e.g. Gaussian kernel). Given a subset (i.e. a point set) of the input distribution, we can compare the kernel density estimates of the input distribution with that of the subset and bound the worst case error. If the maximum error is eps, then this subset can be thought of as an eps-sample (aka an eps-approximation) of the range space defined with the input distribution as the ground set and the fixed kernel representing the family of ranges. Interestingly, in this case the ranges are not binary, but have a continuous range (for simplicity we focus on kernels with range of [0,1]); these allow for smoother notions of range spaces. It turns out, the use of this smoother family of range spaces has an added benefit of greatly decreasing the size required for eps-samples. For instance, in the plane the size is O((1/eps^{4/3}) log^{2/3}(1/eps)) for disks (based on VC-dimension arguments) but is only O((1/eps) sqrt{log (1/eps)}) for Gaussian kernels and for kernels with bounded slope that only affect a bounded domain. These bounds are accomplished by studying the discrepancy of these "kernel" range spaces, and here the improvement in bounds are even more pronounced. In the plane, we show the discrepancy is O(sqrt{log n}) for these kernels, whereas for balls there is a lower bound of Omega(n^{1/4}).

preprint2012arXiv

Geometric Computations on Indecisive and Uncertain Points

We study computing geometric problems on uncertain points. An uncertain point is a point that does not have a fixed location, but rather is described by a probability distribution. When these probability distributions are restricted to a finite number of locations, the points are called indecisive points. In particular, we focus on geometric shape-fitting problems and on building compact distributions to describe how the solutions to these problems vary with respect to the uncertainty in the points. Our main results are: (1) a simple and efficient randomized approximation algorithm for calculating the distribution of any statistic on uncertain data sets; (2) a polynomial, deterministic and exact algorithm for computing the distribution of answers for any LP-type problem on an indecisive point set; and (3) the development of shape inclusion probability (SIP) functions which captures the ambient distribution of shapes fit to uncertain or indecisive point sets and are admissible to the two algorithmic constructions.

preprint2012arXiv

Protocols for Learning Classifiers on Distributed Data

We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node.

preprint2012arXiv

Ranking Large Temporal Data

Ranking temporal data has not been studied until recently, even though ranking is an important operator (being promoted as a firstclass citizen) in database systems. However, only the instant top-k queries on temporal data were studied in, where objects with the k highest scores at a query time instance t are to be retrieved. The instant top-k definition clearly comes with limitations (sensitive to outliers, difficult to choose a meaningful query time t). A more flexible and general ranking operation is to rank objects based on the aggregation of their scores in a query interval, which we dub the aggregate top-k query on temporal data. For example, return the top-10 weather stations having the highest average temperature from 10/01/2010 to 10/07/2010; find the top-20 stocks having the largest total transaction volumes from 02/05/2011 to 02/07/2011. This work presents a comprehensive study to this problem by designing both exact and approximate methods (with approximation quality guarantees). We also provide theoretical analysis on the construction cost, the index size, the update and the query costs of each approach. Extensive experiments on large real datasets clearly demonstrate the efficiency, the effectiveness, and the scalability of our methods compared to the baseline methods.

preprint2011arXiv

A Gentle Introduction to the Kernel Distance

This document reviews the definition of the kernel distance, providing a gentle introduction tailored to a reader with background in theoretical computer science, but limited exposure to technology more common to machine learning, functional analysis and geometric measure theory. The key aspect of the kernel distance developed here is its interpretation as an L_2 distance between probability measures or various shapes (e.g. point sets, curves, surfaces) embedded in a vector space (specifically an RKHS). This structure enables several elegant and efficient solutions to data analysis problems. We conclude with a glimpse into the mathematical underpinnings of this measure, highlighting its recent independent evolution in two separate fields.

preprint2011arXiv

Comparing Distributions and Shapes using the Kernel Distance

Starting with a similarity function between objects, it is possible to define a distance metric on pairs of objects, and more generally on probability distributions over them. These distance metrics have a deep basis in functional analysis, measure theory and geometric measure theory, and have a rich structure that includes an isometric embedding into a (possibly infinite dimensional) Hilbert space. They have recently been applied to numerous problems in machine learning and shape analysis. In this paper, we provide the first algorithmic analysis of these distance metrics. Our main contributions are as follows: (i) We present fast approximation algorithms for computing the kernel distance between two point sets P and Q that runs in near-linear time in the size of (P cup Q) (note that an explicit calculation would take quadratic time). (ii) We present polynomial-time algorithms for approximately minimizing the kernel distance under rigid transformation; they run in time O(n + poly(1/epsilon, log n)). (iii) We provide several general techniques for reducing complex objects to convenient sparse representations (specifically to point sets or sets of points sets) which approximately preserve the kernel distance. In particular, this allows us to reduce problems of computing the kernel distance between various types of objects such as curves, surfaces, and distributions to computing the kernel distance between point sets. These take advantage of the reproducing kernel Hilbert space and a new relation linking binary range spaces to continuous range spaces with bounded fat-shattering dimension.

preprint2011arXiv

Generating a Diverse Set of High-Quality Clusterings

We provide a new framework for generating multiple good quality partitions (clusterings) of a single data set. Our approach decomposes this problem into two components, generating many high-quality partitions, and then grouping these partitions to obtain k representatives. The decomposition makes the approach extremely modular and allows us to optimize various criteria that control the choice of representative partitions.

preprint2011arXiv

Spatially-Aware Comparison and Consensus for Clusterings

This paper proposes a new distance metric between clusterings that incorporates information about the spatial distribution of points and clusters. Our approach builds on the idea of a Hilbert space-based representation of clusters as a combination of the representations of their constituent points. We use this representation and the underlying metric to design a spatially-aware consensus clustering procedure. This consensus procedure is implemented via a novel reduction to Euclidean clustering, and is both simple and efficient. All of our results apply to both soft and hard clusterings. We accompany these algorithms with a detailed experimental evaluation that demonstrates the efficiency and quality of our techniques.

preprint2010arXiv

A Unified Algorithmic Framework for Multi-Dimensional Scaling

In this paper, we propose a unified algorithmic framework for solving many known variants of \mds. Our algorithm is a simple iterative scheme with guaranteed convergence, and is \emph{modular}; by changing the internals of a single subroutine in the algorithm, we can switch cost functions and target spaces easily. In addition to the formal guarantees of convergence, our algorithms are accurate; in most cases, they converge to better quality solutions than existing methods, in comparable time. We expect that this framework will be useful for a number of \mds variants that have not yet been studied. Our framework extends to embedding high-dimensional points lying on a sphere to points on a lower dimensional sphere, preserving geodesic distances. As a compliment to this result, we also extend the Johnson-Lindenstrauss Lemma to this spherical setting, where projecting to a random $O((1/\eps^2) \log n)$-dimensional sphere causes $\eps$-distortion.

preprint2010arXiv

Stability of epsilon-Kernels

Given a set P of n points in |R^d, an eps-kernel K subset P approximates the directional width of P in every direction within a relative (1-eps) factor. In this paper we study the stability of eps-kernels under dynamic insertion and deletion of points to P and by changing the approximation factor eps. In the first case, we say an algorithm for dynamically maintaining a eps-kernel is stable if at most O(1) points change in K as one point is inserted or deleted from P. We describe an algorithm to maintain an eps-kernel of size O(1/eps^{(d-1)/2}) in O(1/eps^{(d-1)/2} + log n) time per update. Not only does our algorithm maintain a stable eps-kernel, its update time is faster than any known algorithm that maintains an eps-kernel of size O(1/eps^{(d-1)/2}). Next, we show that if there is an eps-kernel of P of size k, which may be dramatically less than O(1/eps^{(d-1)/2}), then there is an (eps/2)-kernel of P of size O(min {1/eps^{(d-1)/2}, k^{floor(d/2)} log^{d-2} (1/eps)}). Moreover, there exists a point set P in |R^d and a parameter eps > 0 such that if every eps-kernel of P has size at least k, then any (eps/2)-kernel of P has size Omega(k^{floor(d/2)}).

preprint2005arXiv

The Hunting of the Bump: On Maximizing Statistical Discrepancy

Anomaly detection has important applications in biosurveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of the baseline distribution. Hence, a discrepancy function is often used to examine how different measured data is to baseline data within a region. An anomalous region is thus defined to be one with high discrepancy. In this paper, we present algorithms for maximizing statistical discrepancy functions over the space of axis-parallel rectangles. We give provable approximation guarantees, both additive and relative, and our methods apply to any convex discrepancy function. Our algorithms work by connecting statistical discrepancy to combinatorial discrepancy; roughly speaking, we show that in order to maximize a convex discrepancy function over a class of shapes, one needs only maximize a linear discrepancy function over the same set of shapes. We derive general discrepancy functions for data generated from a one- parameter exponential family. This generalizes the widely-used Kulldorff scan statistic for data from a Poisson distribution. We present an algorithm running in $O(\smash[tb]{\frac{1}ε n^2 \log^2 n})$ that computes the maximum discrepancy rectangle to within additive error $ε$, for the Kulldorff scan statistic. Similar results hold for relative error and for discrepancy functions for data coming from Gaussian, Bernoulli, and gamma distributions. Prior to our work, the best known algorithms were exact and ran in time $\smash[t]{O(n^4)}$.

Jeff M. Phillips

What is connected

Connect this record

See the researcher in context

Building this map preview

29 published item(s)

Optimizing Computational-Statistical Runtime for Wasserstein Distance Estimation

TabKDE: Simple and Scalable Tabular Data Generation with Kernel Density Estimates

The GaussianSketch for Almost Relative Error Kernel Distance

$ε$-Kernel Coresets for Stochastic Points

Efficient Frequent Directions Algorithm for Sparse Matrices

The Robustness of Estimator Composition

Frequent Directions : Simple and Deterministic Matrix Sketching

Geometric Inference on Kernel Density Estimates

Improved Practical Matrix Sketching with Guarantees

Lower Bounds for Number-in-Hand Multiparty Communication Complexity, Made Easy

Streaming Kernel Principal Component Analysis

Subsampling in Smoothed Range Spaces

Continuous Matrix Approximation on Distributed Data

Chernoff-Hoeffding Inequality and Applications

Range Counting Coresets for Uncertain Data

Relative Errors for Deterministic Low-Rank Matrix Approximations

Rethinking Abstractions for Big Data: Why, Where, How, and What

Efficient Protocols for Distributed Classification and Optimization

epsilon-Samples of Kernels

Geometric Computations on Indecisive and Uncertain Points

Protocols for Learning Classifiers on Distributed Data

Ranking Large Temporal Data

A Gentle Introduction to the Kernel Distance

Comparing Distributions and Shapes using the Kernel Distance

Generating a Diverse Set of High-Quality Clusterings

Spatially-Aware Comparison and Consensus for Clusterings

A Unified Algorithmic Framework for Multi-Dimensional Scaling

Stability of epsilon-Kernels

The Hunting of the Bump: On Maximizing Statistical Discrepancy