Source author record

Andrew McGregor

Andrew McGregor appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Machine Learning Information Theory math.IT Computational Complexity Computational Geometry Cryptography and Security Databases Emerging Technologies Performance Programming Languages quant-ph

Catalog footprint

What is connected

16works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Optimizing compilation of error correction codes for 2xN quantum dot arrays and its NP-hardness

The ability to physically move qubits within a register allows the design of hardware-specific error-correction codes, which can achieve fault-tolerance while respecting other constraints. In particular, recent advancements have demonstrated the shuttling of electron and hole spin qubits through a quantum dot array with high fidelity. It is therefore timely to explore error correction architectures consisting merely of two parallel quantum dot arrays, an experimentally validated architecture compatible with classical wiring and control constraints. Upon such an architecture, we develop a suite of heuristic methods for compiling any Calderbank-Shor-Steane (CSS) error-correcting code's syndrome-extraction circuit to run with a reduced number of shuttling operations. We demonstrate how column-regular qLDPC codes can be compiled in a provably minimal number of shuttles that is exactly equal to the column weight of the code when Shor-style syndrome extraction is used. We provide tables stating the number of required shuttles for many contemporary codes of interest. In addition, we provide a proof of the NP hardness of minimizing the number of shuttle operations for general codes, even when using Shor syndrome extraction. We also discuss how one could get around this by placing blanks in the ancilla array to achieve minimal shuttles with Shor syndrome extraction on any CSS code, at the cost of longer ancilla arrays.

preprint2022arXiv

Estimation of Entropy in Constant Space with Improved Sample Complexity

Recent work of Acharya et al. (NeurIPS 2019) showed how to estimate the entropy of a distribution $\mathcal D$ over an alphabet of size $k$ up to $\pmε$ additive error by streaming over $(k/ε^3) \cdot \text{polylog}(1/ε)$ i.i.d. samples and using only $O(1)$ words of memory. In this work, we give a new constant memory scheme that reduces the sample complexity to $(k/ε^2)\cdot \text{polylog}(1/ε)$. We conjecture that this is optimal up to $\text{polylog}(1/ε)$ factors.

preprint2022arXiv

Improved Approximation and Scalability for Fair Max-Min Diversification

Given an $n$-point metric space $(\mathcal{X},d)$ where each point belongs to one of $m=O(1)$ different categories or groups and a set of integers $k_1, \ldots, k_m$, the fair Max-Min diversification problem is to select $k_i$ points belonging to category $i\in [m]$, such that the minimum pairwise distance between selected points is maximized. The problem was introduced by Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample large data sets in various applications so that the derived sample achieves a balance over diversity, i.e., the minimum distance between a pair of selected points, and fairness, i.e., ensuring enough points of each category are included. We prove the following results: 1. We first consider general metric spaces. We present a randomized polynomial time algorithm that returns a factor $2$-approximation to the diversity but only satisfies the fairness constraints in expectation. Building upon this result, we present a $6$-approximation that is guaranteed to satisfy the fairness constraints up to a factor $1-ε$ for any constant $ε$. We also present a linear time algorithm returning an $m+1$ approximation with exact fairness. The best previous result was a $3m-1$ approximation. 2. We then focus on Euclidean metrics. We first show that the problem can be solved exactly in one dimension. For constant dimensions, categories and any constant $ε>0$, we present a $1+ε$ approximation algorithm that runs in $O(nk) + 2^{O(k)}$ time where $k=k_1+\ldots+k_m$. We can improve the running time to $O(nk)+ poly(k)$ at the expense of only picking $(1-ε) k_i$ points from category $i\in [m]$. Finally, we present algorithms suitable to processing massive data sets including single-pass data stream algorithms and composable coresets for the distributed processing.

preprint2022arXiv

Non-Adaptive Edge Counting and Sampling via Bipartite Independent Set Queries

We study the problem of estimating the number of edges in an $n$-vertex graph, accessed via the Bipartite Independent Set query model introduced by Beame et al. (ITCS '18). In this model, each query returns a Boolean, indicating the existence of at least one edge between two specified sets of nodes. We present a non-adaptive algorithm that returns a $(1\pm ε)$ relative error approximation to the number of edges, with query complexity $\tilde O(ε^{-5}\log^{5} n )$, where $\tilde O(\cdot)$ hides $\textrm{poly}(\log \log n)$ dependencies. This is the first non-adaptive algorithm in this setting achieving $\textrm{poly}(1/ε,\log n)$ query complexity. Prior work requires $Ω(\log^2 n)$ rounds of adaptivity. We avoid this by taking a fundamentally different approach, inspired by work on single-pass streaming algorithms. Moreover, for constant $ε$, our query complexity significantly improves on the best known adaptive algorithm due to Bhattacharya et al. (STACS '22), which requires $O(ε^{-2} \log^{11} n)$ queries. Building on our edge estimation result, we give the first non-adaptive algorithm for outputting a nearly uniformly sampled edge with query complexity $\tilde O(ε^{-6} \log^{6} n)$, improving on the works of Dell et al. (SODA '20) and Bhattacharya et al. (STACS '22), which require $Ω(\log^3 n)$ rounds of adaptivity. Finally, as a consequence of our edge sampling algorithm, we obtain a $\tilde O(n\log^ 8 n)$ query algorithm for connectivity, using two rounds of adaptivity. This improves on a three-round algorithm of Assadi et al. (ESA '21) and is tight; there is no non-adaptive algorithm for connectivity making $o(n^2)$ queries.

preprint2021arXiv

Maximum Coverage in the Data Stream Model: Parameterized and Generalized

We present algorithms for the Max-Cover and Max-Unique-Cover problems in the data stream model. The input to both problems are $m$ subsets of a universe of size $n$ and a value $k\in [m]$. In Max-Cover, the problem is to find a collection of at most $k$ sets such that the number of elements covered by at least one set is maximized. In Max-Unique-Cover, the problem is to find a collection of at most $k$ sets such that the number of elements covered by exactly one set is maximized. Our goal is to design single-pass algorithms that use space that is sublinear in the input size. Our main algorithmic results are: If the sets have size at most $d$, there exist single-pass algorithms using $\tilde{O}(d^{d+1} k^d)$ space that solve both problems exactly. This is optimal up to polylogarithmic factors for constant $d$. If each element appears in at most $r$ sets, we present single pass algorithms using $\tilde{O}(k^2 r/ε^3)$ space that return a $1+ε$ approximation in the case of Max-Cover. We also present a single-pass algorithm using slightly more memory, i.e., $\tilde{O}(k^3 r/ε^{4})$ space, that $1+ε$ approximates Max-Unique-Cover. In contrast to the above results, when $d$ and $r$ are arbitrary, any constant pass $1+ε$ approximation algorithm for either problem requires $Ω(ε^{-2}m)$ space but a single pass $O(ε^{-2}mk)$ space algorithm exists. In fact any constant-pass algorithm with an approximation better than $e/(e-1)$ and $e^{1-1/k}$ for Max-Cover and Max-Unique-Cover respectively requires $Ω(m/k^2)$ space when $d$ and $r$ are unrestricted. En route, we also obtain an algorithm for a parameterized version of the streaming Set-Cover problem.

preprint2020arXiv

Algebraic and Analytic Approaches for Parameter Learning in Mixture Models

We present two different approaches for parameter learning in several mixture models in one dimension. Our first approach uses complex-analytic methods and applies to Gaussian mixtures with shared variance, binomial mixtures with shared success probability, and Poisson mixtures, among others. An example result is that $\exp(O(N^{1/3}))$ samples suffice to exactly learn a mixture of $k<N$ Poisson distributions, each with integral rate parameters bounded by $N$. Our second approach uses algebraic and combinatorial tools and applies to binomial mixtures with shared trial parameter $N$ and differing success parameters, as well as to mixtures of geometric distributions. Again, as an example, for binomial mixtures with $k$ components and success parameters discretized to resolution $ε$, $O(k^2(N/ε)^{8/\sqrtε})$ samples suffice to exactly recover the parameters. For some of these distributions, our results represent the first guarantees for parameter estimation.

preprint2020arXiv

Efficient Intervention Design for Causal Discovery with Latents

We consider recovering a causal graph in presence of latent variables, where we seek to minimize the cost of interventions used in the recovery process. We consider two intervention cost models: (1) a linear cost model where the cost of an intervention on a subset of variables has a linear form, and (2) an identity cost model where the cost of an intervention is the same, regardless of what variables it is on, i.e., the goal is just to minimize the number of interventions. Under the linear cost model, we give an algorithm to identify the ancestral relations of the underlying causal graph, achieving within a $2$-factor of the optimal intervention cost. This approximation factor can be improved to $1+ε$ for any $ε> 0$ under some mild restrictions. Under the identity cost model, we bound the number of interventions needed to recover the entire causal graph, including the latent variables, using a parameterization of the causal graph through a special type of colliders. In particular, we introduce the notion of $p$-colliders, that are colliders between pair of nodes arising from a specific type of conditioning in the causal graph, and provide an upper bound on the number of interventions as a function of the maximum number of $p$-colliders between any two nodes in the causal graph.

preprint2019arXiv

Mesh: Compacting Memory Management for C/C++ Applications

Programs written in C/C++ can suffer from serious memory fragmentation, leading to low utilization of memory, degraded performance, and application failure due to memory exhaustion. This paper introduces Mesh, a plug-in replacement for malloc that, for the first time, eliminates fragmentation in unmodified C/C++ applications. Mesh combines novel randomized algorithms with widely-supported virtual memory operations to provably reduce fragmentation, breaking the classical Robson bounds with high probability. Mesh generally matches the runtime performance of state-of-the-art memory allocators while reducing memory consumption; in particular, it reduces the memory of consumption of Firefox by 16% and Redis by 39%.

preprint2015arXiv

Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution

The degree distribution is one of the most fundamental graph properties of interest for real-world graphs. It has been widely observed in numerous domains that graphs typically have a tailed or scale-free degree distribution. While the average degree is usually quite small, the variance is quite high and there are vertices with degrees at all scales. We focus on the problem of approximating the degree distribution of a large streaming graph, with small storage. We design an algorithm headtail, whose main novelty is a new estimator of infrequent degrees using truncated geometric random variables. We give a mathematical analysis of headtail and show that it has excellent behavior in practice. We can process streams will millions of edges with storage less than 1% and get extremely accurate approximations for all scales in the degree distribution. We also introduce a new notion of Relative Hausdorff distance between tailed histograms. Existing notions of distances between distributions are not suitable, since they ignore infrequent degrees in the tail. The Relative Hausdorff distance measures deviations at all scales, and is a more suitable distance for comparing degree distributions. By tracking this new measure, we are able to give strong empirical evidence of the convergence of headtail.

preprint2015arXiv

Densest Subgraph in Dynamic Graph Streams

In this paper, we consider the problem of approximating the densest subgraph in the dynamic graph stream model. In this model of computation, the input graph is defined by an arbitrary sequence of edge insertions and deletions and the goal is to analyze properties of the resulting graph given memory that is sub-linear in the size of the stream. We present a single-pass algorithm that returns a $(1+ε)$ approximation of the maximum density with high probability; the algorithm uses $O(ε^{-2} n \polylog n)$ space, processes each stream update in $\polylog (n)$ time, and uses $\poly(n)$ post-processing time where $n$ is the number of nodes. The space used by our algorithm matches the lower bound of Bahmani et al.~(PVLDB 2012) up to a poly-logarithmic factor for constant $ε$. The best existing results for this problem were established recently by Bhattacharya et al.~(STOC 2015). They presented a $(2+ε)$ approximation algorithm using similar space and another algorithm that both processed each update and maintained a $(4+ε)$ approximation of the current maximum density in $\polylog (n)$ time per-update.

preprint2015arXiv

Kernelization via Sampling with Applications to Dynamic Graph Streams

In this paper we present a simple but powerful subgraph sampling primitive that is applicable in a variety of computational models including dynamic graph streams (where the input graph is defined by a sequence of edge/hyperedge insertions and deletions) and distributed systems such as MapReduce. In the case of dynamic graph streams, we use this primitive to prove the following results: -- Matching: First, there exists an $\tilde{O}(k^2)$ space algorithm that returns an exact maximum matching on the assumption the cardinality is at most $k$. The best previous algorithm used $\tilde{O}(kn)$ space where $n$ is the number of vertices in the graph and we prove our result is optimal up to logarithmic factors. Our algorithm has $\tilde{O}(1)$ update time. Second, there exists an $\tilde{O}(n^2/α^3)$ space algorithm that returns an $α$-approximation for matchings of arbitrary size. (Assadi et al. (2015) showed that this was optimal and independently and concurrently established the same upper bound.) We generalize both results for weighted matching. Third, there exists an $\tilde{O}(n^{4/5})$ space algorithm that returns a constant approximation in graphs with bounded arboricity. -- Vertex Cover and Hitting Set: There exists an $\tilde{O}(k^d)$ space algorithm that solves the minimum hitting set problem where $d$ is the cardinality of the input sets and $k$ is an upper bound on the size of the minimum hitting set. We prove this is optimal up to logarithmic factors. Our algorithm has $\tilde{O}(1)$ update time. The case $d=2$ corresponds to minimum vertex cover. Finally, we consider a larger family of parameterized problems (including $b$-matching, disjoint paths, vertex coloring among others) for which our subgraph sampling primitive yields fast, small-space dynamic graph stream algorithms. We then show lower bounds for natural problems outside this family.

preprint2015arXiv

Run Generation Revisited: What Goes Up May or May Not Come Down

In this paper, we revisit the classic problem of run generation. Run generation is the first phase of external-memory sorting, where the objective is to scan through the data, reorder elements using a small buffer of size M , and output runs (contiguously sorted chunks of elements) that are as long as possible. We develop algorithms for minimizing the total number of runs (or equivalently, maximizing the average run length) when the runs are allowed to be sorted or reverse sorted. We study the problem in the online setting, both with and without resource augmentation, and in the offline setting. (1) We analyze alternating-up-down replacement selection (runs alternate between sorted and reverse sorted), which was studied by Knuth as far back as 1963. We show that this simple policy is asymptotically optimal. Specifically, we show that alternating-up-down replacement selection is 2-competitive and no deterministic online algorithm can perform better. (2) We give online algorithms having smaller competitive ratios with resource augmentation. Specifically, we exhibit a deterministic algorithm that, when given a buffer of size 4M , is able to match or beat any optimal algorithm having a buffer of size M . Furthermore, we present a randomized online algorithm which is 7/4-competitive when given a buffer twice that of the optimal. (3) We demonstrate that performance can also be improved with a small amount of foresight. We give an algorithm, which is 3/2-competitive, with foreknowledge of the next 3M elements of the input stream. For the extreme case where all future elements are known, we design a PTAS for computing the optimal strategy a run generation algorithm must follow. (4) Finally, we present algorithms tailored for nearly sorted inputs which are guaranteed to have optimal solutions with sufficiently long runs.

preprint2015arXiv

Sketching, Embedding, and Dimensionality Reduction for Information Spaces

Information distances like the Hellinger distance and the Jensen-Shannon divergence have deep roots in information theory and machine learning. They are used extensively in data analysis especially when the objects being compared are high dimensional empirical probability distributions built from data. However, we lack common tools needed to actually use information distances in applications efficiently and at scale with any kind of provable guarantees. We can't sketch these distances easily, or embed them in better behaved spaces, or even reduce the dimensionality of the space while maintaining the probability structure of the data. In this paper, we build these tools for information distances---both for the Hellinger distance and Jensen--Shannon divergence, as well as related measures, like the $χ^2$ divergence. We first show that they can be sketched efficiently (i.e. up to multiplicative error in sublinear space) in the aggregate streaming model. This result is exponentially stronger than known upper bounds for sketching these distances in the strict turnstile streaming model. Second, we show a finite dimensionality embedding result for the Jensen-Shannon and $χ^2$ divergences that preserves pair wise distances. Finally we prove a dimensionality reduction result for the Hellinger, Jensen--Shannon, and $χ^2$ divergences that preserves the information geometry of the distributions (specifically, by retaining the simplex structure of the space). While our second result above already implies that these divergences can be explicitly embedded in Euclidean space, retaining the simplex structure is important because it allows us to continue doing inference in the reduced space. In essence, we preserve not just the distance structure but the underlying geometry of the space.

preprint2012arXiv

Approximate Principal Direction Trees

We introduce a new spatial data structure for high dimensional data called the \emph{approximate principal direction tree} (APD tree) that adapts to the intrinsic dimension of the data. Our algorithm ensures vector-quantization accuracy similar to that of computationally-expensive PCA trees with similar time-complexity to that of lower-accuracy RP trees. APD trees use a small number of power-method iterations to find splitting planes for recursively partitioning the data. As such they provide a natural trade-off between the running-time and accuracy achieved by RP and PCA trees. Our theoretical results establish a) strong performance guarantees regardless of the convergence rate of the power-method and b) that $O(\log d)$ iterations suffice to establish the guarantee of PCA trees when the intrinsic dimension is $d$. We demonstrate this trade-off and the efficacy of our data structure on both the CPU and GPU.

preprint2010arXiv

Information Cost Tradeoffs for Augmented Index and Streaming Language Recognition

This paper makes three main contributions to the theory of communication complexity and stream computation. First, we present new bounds on the information complexity of AUGMENTED-INDEX. In contrast to analogous results for INDEX by Jain, Radhakrishnan and Sen [J. ACM, 2009], we have to overcome the significant technical challenge that protocols for AUGMENTED-INDEX may violate the "rectangle property" due to the inherent input sharing. Second, we use these bounds to resolve an open problem of Magniez, Mathieu and Nayak [STOC, 2010] that asked about the multi-pass complexity of recognizing Dyck languages. This results in a natural separation between the standard multi-pass model and the multi-pass model that permits reverse passes. Third, we present the first passive memory checkers that verify the interaction transcripts of priority queues, stacks, and double-ended queues. We obtain tight upper and lower bounds for these problems, thereby addressing an important sub-class of the memory checking framework of Blum et al. [Algorithmica, 1994].

preprint2010arXiv

Optimizing Histogram Queries under Differential Privacy

Differential privacy is a robust privacy standard that has been successfully applied to a range of data analysis tasks. Despite much recent work, optimal strategies for answering a collection of correlated queries are not known. We study the problem of devising a set of strategy queries, to be submitted and answered privately, that will support the answers to a given workload of queries. We propose a general framework in which query strategies are formed from linear combinations of counting queries, and we describe an optimal method for deriving new query answers from the answers to the strategy queries. Using this framework we characterize the error of strategies geometrically, and we propose solutions to the problem of finding optimal strategies.

Andrew McGregor

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

Optimizing compilation of error correction codes for 2xN quantum dot arrays and its NP-hardness

Estimation of Entropy in Constant Space with Improved Sample Complexity

Improved Approximation and Scalability for Fair Max-Min Diversification

Non-Adaptive Edge Counting and Sampling via Bipartite Independent Set Queries

Maximum Coverage in the Data Stream Model: Parameterized and Generalized

Algebraic and Analytic Approaches for Parameter Learning in Mixture Models

Efficient Intervention Design for Causal Discovery with Latents

Mesh: Compacting Memory Management for C/C++ Applications

Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution

Densest Subgraph in Dynamic Graph Streams

Kernelization via Sampling with Applications to Dynamic Graph Streams

Run Generation Revisited: What Goes Up May or May Not Come Down

Sketching, Embedding, and Dimensionality Reduction for Information Spaces

Approximate Principal Direction Trees

Information Cost Tradeoffs for Augmented Index and Streaming Language Recognition

Optimizing Histogram Queries under Differential Privacy