Source author record

Grigory Yaroslavtsev

Grigory Yaroslavtsev appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Discrete Mathematics Distributed, Parallel, and Cluster Computing Machine Learning Artificial Intelligence Computational Complexity Cryptography and Security cs.CY Databases Social and Information Networks

Catalog footprint

What is connected

12works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Objective-Based Hierarchical Clustering of Deep Embedding Vectors

We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to $4.5$ million entries with embedding dimensions up to $2048$. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a $0.74$-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial $2/3$-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of $\approx 2/3 + 0.0004$ was due to Charikar et al. (SODA'19).

preprint2016arXiv

Linear Sketching over $\mathbb F_2$

We initiate a systematic study of linear sketching over $\mathbb F_2$. For a given Boolean function $f \colon \{0,1\}^n \to \{0,1\}$ a randomized $\mathbb F_2$-sketch is a distribution $\mathcal M$ over $d \times n$ matrices with elements over $\mathbb F_2$ such that $\mathcal Mx$ suffices for computing $f(x)$ with high probability. We study a connection between $\mathbb F_2$-sketching and a two-player one-way communication game for the corresponding XOR-function. Our results show that this communication game characterizes $\mathbb F_2$-sketching under the uniform distribution (up to dependence on error). Implications of this result include: 1) a composition theorem for $\mathbb F_2$-sketching complexity of a recursive majority function, 2) a tight relationship between $\mathbb F_2$-sketching complexity and Fourier sparsity, 3) lower bounds for a certain subclass of symmetric functions. We also fully resolve a conjecture of Montanaro and Osborne regarding one-way communication complexity of linear threshold functions by designing an $\mathbb F_2$-sketch of optimal size. Furthermore, we show that (non-uniform) streaming algorithms that have to process random updates over $\mathbb F_2$ can be constructed as $\mathbb F_2$-sketches for the uniform distribution with only a minor loss. In contrast with the previous work of Li, Nguyen and Woodruff (STOC'14) who show an analogous result for linear sketches over integers in the adversarial setting our result doesn't require the stream length to be triply exponential in $n$ and holds for streams of length $\tilde O(n)$ constructed through uniformly random updates. Finally, we state a conjecture that asks whether optimal one-way communication protocols for XOR-functions can be constructed as $\mathbb F_2$-sketches with only a small loss.

preprint2015arXiv

Near Optimal LP Rounding Algorithm for Correlation Clustering on Complete and Complete k-partite Graphs

We give new rounding schemes for the standard linear programming relaxation of the correlation clustering problem, achieving approximation factors almost matching the integrality gaps: - For complete graphs our appoximation is $2.06 - \varepsilon$ for a fixed constant $\varepsilon$, which almost matches the previously known integrality gap of $2$. - For complete $k$-partite graphs our approximation is $3$. We also show a matching integrality gap. - For complete graphs with edge weights satisfying triangle inequalities and probability constraints, our approximation is $1.5$, and we show an integrality gap of $1.2$. Our results improve a long line of work on approximation algorithms for correlation clustering in complete graphs, previously culminating in a ratio of $2.5$ for the complete case by Ailon, Charikar and Newman (JACM'08). In the weighted complete case satisfying triangle inequalities and probability constraints, the same authors give a $2$-approximation; for the bipartite case, Ailon, Avigdor-Elgrabli, Liberty and van Zuylen give a $4$-approximation (SICOMP'12).

preprint2015arXiv

Privacy for the Protected (Only)

Motivated by tensions between data privacy for individual citizens, and societal priorities such as counterterrorism and the containment of infectious disease, we introduce a computational model that distinguishes between parties for whom privacy is explicitly protected, and those for whom it is not (the targeted subpopulation). The goal is the development of algorithms that can effectively identify and take action upon members of the targeted subpopulation in a way that minimally compromises the privacy of the protected, while simultaneously limiting the expense of distinguishing members of the two groups via costly mechanisms such as surveillance, background checks, or medical testing. Within this framework, we provide provably privacy-preserving algorithms for targeted search in social networks. These algorithms are natural variants of common graph search methods, and ensure privacy for the protected by the careful injection of noise in the prioritization of potential targets. We validate the utility of our algorithms with extensive computational experiments on two large-scale social network datasets.

preprint2015arXiv

Tight Bounds for Linear Sketches of Approximate Matchings

We resolve the space complexity of linear sketches for approximating the maximum matching problem in dynamic graph streams where the stream may include both edge insertion and deletion. Specifically, we show that for any $ε> 0$, there exists a one-pass streaming algorithm, which only maintains a linear sketch of size $\tilde{O}(n^{2-3ε})$ bits and recovers an $n^ε$-approximate maximum matching in dynamic graph streams, where $n$ is the number of vertices in the graph. In contrast to the extensively studied insertion-only model, to the best of our knowledge, no non-trivial single-pass streaming algorithms were previously known for approximating the maximum matching problem on general dynamic graph streams. Furthermore, we show that our upper bound is essentially tight. Namely, any linear sketch for approximating the maximum matching to within a factor of $O(n^ε)$ has to be of size $n^{2-3ε-o(1)}$ bits. We establish this lower bound by analyzing the corresponding simultaneous number-in-hand communication model, with a combinatorial construction based on Ruzsa-Szemerédi graphs.

preprint2014arXiv

Going for Speed: Sublinear Algorithms for Dense r-CSPs

We give new sublinear and parallel algorithms for the extensively studied problem of approximating n-variable r-CSPs (constraint satisfaction problems with constraints of arity r up to an additive error. The running time of our algorithms is O(n/ε^2) + 2^O(1/ε^2) for Boolean r-CSPs and O(k^4 n / ε^2) + 2^O(log k / ε^2) for r-CSPs with constraints on variables over an alphabet of size k. For any constant k this gives optimal dependence on n in the running time unconditionally, while the exponent in the dependence on 1/εis polynomially close to the lower bound under the exponential-time hypothesis, which is 2^Ω(ε^(-1/2)). For Max-Cut this gives an exponential improvement in dependence on 1/εcompared to the sublinear algorithms of Goldreich, Goldwasser and Ron (JACM'98) and a linear speedup in n compared to the algorithms of Mathieu and Schudy (SODA'08). For the maximization version of k-Correlation Clustering problem our running time is O(k^4 n / ε^2) + k^O(1/ε^2), improving the previously best n k^{O(1/ε^3 log k/ε) by Guruswami and Giotis (SODA'06).

preprint2014arXiv

Online Algorithms for Machine Minimization

In this paper, we consider the online version of the machine minimization problem (introduced by Chuzhoy et al., FOCS 2004), where the goal is to schedule a set of jobs with release times, deadlines, and processing lengths on a minimum number of identical machines. Since the online problem has strong lower bounds if all the job parameters are arbitrary, we focus on jobs with uniform length. Our main result is a complete resolution of the deterministic complexity of this problem by showing that a competitive ratio of $e$ is achievable and optimal, thereby improving upon existing lower and upper bounds of 2.09 and 5.2 respectively. We also give a constant-competitive online algorithm for the case of uniform deadlines (but arbitrary job lengths); to the best of our knowledge, no such algorithm was known previously. Finally, we consider the complimentary problem of throughput maximization where the goal is to maximize the sum of weights of scheduled jobs on a fixed set of identical machines (introduced by Bar-Noy et al. STOC 1999). We give a randomized online algorithm for this problem with a competitive ratio of e/e-1; previous results achieved this bound only for the case of a single machine or in the limit of an infinite number of machines.

preprint2014arXiv

Parallel Algorithms for Geometric Graph Problems

We give algorithms for geometric graph problems in the modern parallel models inspired by MapReduce. For example, for the Minimum Spanning Tree (MST) problem over a set of points in the two-dimensional space, our algorithm computes a $(1+ε)$-approximate MST. Our algorithms work in a constant number of rounds of communication, while using total space and communication proportional to the size of the data (linear space and near linear time algorithms). In contrast, for general graphs, achieving the same result for MST (or even connectivity) remains a challenging open problem, despite drawing significant attention in recent years. We develop a general algorithmic framework that, besides MST, also applies to Earth-Mover Distance (EMD) and the transportation cost problem. Our algorithmic framework has implications beyond the MapReduce model. For example it yields a new algorithm for computing EMD cost in the plane in near-linear time, $n^{1+o_ε(1)}$. We note that while recently Sharathkumar and Agarwal developed a near-linear time algorithm for $(1+ε)$-approximating EMD, our algorithm is fundamentally different, and, for example, also solves the transportation (cost) problem, raised as an open question in their work. Furthermore, our algorithm immediately gives a $(1+ε)$-approximation algorithm with $n^δ$ space in the streaming-with-sorting model with $1/δ^{O(1)}$ passes. As such, it is tempting to conjecture that the parallel models may also constitute a concrete playground in the quest for efficient algorithms for EMD (and other similar problems) in the vanilla streaming model, a well-known open problem.

preprint2013arXiv

The Round Complexity of Small Set Intersection

The set disjointness problem is one of the most fundamental and well-studied problems in communication complexity. In this problem Alice and Bob hold sets $S, T \subseteq [n]$, respectively, and the goal is to decide if $S \cap T = \emptyset$. Reductions from set disjointness are a canonical way of proving lower bounds in data stream algorithms, data structures, and distributed computation. In these applications, often the set sizes $|S|$ and $|T|$ are bounded by a value $k$ which is much smaller than $n$. This is referred to as small set disjointness. A major restriction in the above applications is the number of rounds that the protocol can make, which, e.g., translates to the number of passes in streaming applications. A fundamental question is thus in understanding the round complexity of the small set disjointness problem. For an essentially equivalent problem, called OR-Equality, Brody et. al showed that with $r$ rounds of communication, the randomized communication complexity is $Ω(k \ilog^r k)$, where$\ilog^r k$ denotes the $r$-th iterated logarithm function. Unfortunately their result requires the error probability of the protocol to be $1/k^{Θ(1)}$. Since naïve amplification of the success probability of a protocol from constant to $1-1/k^{Θ(1)}$ blows up the communication by a $Θ(\log k)$ factor, this destroys their improvements over the well-known lower bound of $Ω(k)$ which holds for any number of rounds. They pose it as an open question to achieve the same $Ω(k \ilog^r k)$ lower bound for protocols with constant error probability. We answer this open question by showing that the $r$-round randomized communication complexity of ${\sf OREQ}_{n,k}$, and thus also of small set disjointness, with {\it constant error probability} is $Ω(k \ilog^r k)$, asymptotically matching known upper bounds for ${\sf OREQ}_{n,k}$ and small set disjointness.

preprint2012arXiv

Accurate and Efficient Private Release of Datacubes and Contingency Tables

A central problem in releasing aggregate information about sensitive data is to do so accurately while providing a privacy guarantee on the output. Recent work focuses on the class of linear queries, which include basic counting queries, data cubes, and contingency tables. The goal is to maximize the utility of their output, while giving a rigorous privacy guarantee. Most results follow a common template: pick a "strategy" set of linear queries to apply to the data, then use the noisy answers to these queries to reconstruct the queries of interest. This entails either picking a strategy set that is hoped to be good for the queries, or performing a costly search over the space of all possible strategies. In this paper, we propose a new approach that balances accuracy and efficiency: we show how to improve the accuracy of a given query set by answering some strategy queries more accurately than others. This leads to an efficient optimal noise allocation for many popular strategies, including wavelets, hierarchies, Fourier coefficients and more. For the important case of marginal queries we show that this strictly improves on previous methods, both analytically and empirically. Our results also extend to ensuring that the returned query answers are consistent with an (unknown) data set at minimal extra cost in terms of time and noise.

preprint2012arXiv

Learning pseudo-Boolean k-DNF and Submodular Functions

We prove that any submodular function f: {0,1}^n -> {0,1,...,k} can be represented as a pseudo-Boolean 2k-DNF formula. Pseudo-Boolean DNFs are a natural generalization of DNF representation for functions with integer range. Each term in such a formula has an associated integral constant. We show that an analog of Hastad's switching lemma holds for pseudo-Boolean k-DNFs if all constants associated with the terms of the formula are bounded. This allows us to generalize Mansour's PAC-learning algorithm for k-DNFs to pseudo-Boolean k-DNFs, and hence gives a PAC-learning algorithm with membership queries under the uniform distribution for submodular functions of the form f:{0,1}^n -> {0,1,...,k}. Our algorithm runs in time polynomial in n, k^{O(k \log k / ε)}, 1/εand log(1/δ) and works even in the agnostic setting. The line of previous work on learning submodular functions [Balcan, Harvey (STOC '11), Gupta, Hardt, Roth, Ullman (STOC '11), Cheraghchi, Klivans, Kothari, Lee (SODA '12)] implies only n^{O(k)} query complexity for learning submodular functions in this setting, for fixed epsilon and delta. Our learning algorithm implies a property tester for submodularity of functions f:{0,1}^n -> {0, ..., k} with query complexity polynomial in n for k=O((\log n/ \loglog n)^{1/2}) and constant proximity parameter ε.

preprint2010arXiv

Steiner Transitive-Closure Spanners of d-Dimensional Posets

Given a directed graph G and an integer k >= 1, a k-transitive-closure-spanner (k-TCspanner) of G is a directed graph H that has (1) the same transitive-closure as G and (2) diameter at most k. In some applications, the shortcut paths added to the graph in order to obtain small diameter can use Steiner vertices, that is, vertices not in the original graph G. The resulting spanner is called a Steiner transitive-closure spanner (Steiner TC-spanner). Motivated by applications to property reconstruction and access control hierarchies, we concentrate on Steiner TC-spanners of directed acyclic graphs or, equivalently, partially ordered sets. In these applications, the goal is to find a sparsest Steiner k-TC-spanner of a poset G for a given k and G. The focus of this paper is the relationship between the dimension of a poset and the size of its sparsest Steiner TCspanner. The dimension of a poset G is the smallest d such that G can be embedded into a d-dimensional directed hypergrid via an order-preserving embedding. We present a nearly tight lower bound on the size of Steiner 2-TC-spanners of d-dimensional directed hypergrids. It implies better lower bounds on the complexity of local reconstructors of monotone functions and functions with low Lipschitz constant. The proof of the lower bound constructs a dual solution to a linear programming relaxation of the Steiner 2-TC-spanner problem. We also show that one can efficiently construct a Steiner 2-TC-spanner, of size matching the lower bound, for any low-dimensional poset. Finally, we present a lower bound on the size of Steiner k-TC-spanners of d-dimensional posets that shows that the best-known construction, due to De Santis et al., cannot be improved significantly.

Grigory Yaroslavtsev

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Objective-Based Hierarchical Clustering of Deep Embedding Vectors

Linear Sketching over $\mathbb F_2$

Near Optimal LP Rounding Algorithm for Correlation Clustering on Complete and Complete k-partite Graphs

Privacy for the Protected (Only)

Tight Bounds for Linear Sketches of Approximate Matchings

Going for Speed: Sublinear Algorithms for Dense r-CSPs

Online Algorithms for Machine Minimization

Parallel Algorithms for Geometric Graph Problems

The Round Complexity of Small Set Intersection

Accurate and Efficient Private Release of Datacubes and Contingency Tables

Learning pseudo-Boolean k-DNF and Submodular Functions

Steiner Transitive-Closure Spanners of d-Dimensional Posets