Researcher profile

Grigory Yaroslavtsev

Grigory Yaroslavtsev contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2022arXiv

Objective-Based Hierarchical Clustering of Deep Embedding Vectors

We initiate a comprehensive experimental study of objective-based hierarchical clustering methods on massive datasets consisting of deep embedding vectors from computer vision and NLP applications. This includes a large variety of image embedding (ImageNet, ImageNetV2, NaBirds), word embedding (Twitter, Wikipedia), and sentence embedding (SST-2) vectors from several popular recent models (e.g. ResNet, ResNext, Inception V3, SBERT). Our study includes datasets with up to $4.5$ million entries with embedding dimensions up to $2048$. In order to address the challenge of scaling up hierarchical clustering to such large datasets we propose a new practical hierarchical clustering algorithm B++&C. It gives a 5%/20% improvement on average for the popular Moseley-Wang (MW) / Cohen-Addad et al. (CKMM) objectives (normalized) compared to a wide range of classic methods and recent heuristics. We also introduce a theoretical algorithm B2SAT&C which achieves a $0.74$-approximation for the CKMM objective in polynomial time. This is the first substantial improvement over the trivial $2/3$-approximation achieved by a random binary tree. Prior to this work, the best poly-time approximation of $\approx 2/3 + 0.0004$ was due to Charikar et al. (SODA'19).

preprint2014arXiv

Going for Speed: Sublinear Algorithms for Dense r-CSPs

We give new sublinear and parallel algorithms for the extensively studied problem of approximating n-variable r-CSPs (constraint satisfaction problems with constraints of arity r up to an additive error. The running time of our algorithms is O(n/ε^2) + 2^O(1/ε^2) for Boolean r-CSPs and O(k^4 n / ε^2) + 2^O(log k / ε^2) for r-CSPs with constraints on variables over an alphabet of size k. For any constant k this gives optimal dependence on n in the running time unconditionally, while the exponent in the dependence on 1/εis polynomially close to the lower bound under the exponential-time hypothesis, which is 2^Ω(ε^(-1/2)). For Max-Cut this gives an exponential improvement in dependence on 1/εcompared to the sublinear algorithms of Goldreich, Goldwasser and Ron (JACM'98) and a linear speedup in n compared to the algorithms of Mathieu and Schudy (SODA'08). For the maximization version of k-Correlation Clustering problem our running time is O(k^4 n / ε^2) + k^O(1/ε^2), improving the previously best n k^{O(1/ε^3 log k/ε) by Guruswami and Giotis (SODA'06).

preprint2014arXiv

Online Algorithms for Machine Minimization

In this paper, we consider the online version of the machine minimization problem (introduced by Chuzhoy et al., FOCS 2004), where the goal is to schedule a set of jobs with release times, deadlines, and processing lengths on a minimum number of identical machines. Since the online problem has strong lower bounds if all the job parameters are arbitrary, we focus on jobs with uniform length. Our main result is a complete resolution of the deterministic complexity of this problem by showing that a competitive ratio of $e$ is achievable and optimal, thereby improving upon existing lower and upper bounds of 2.09 and 5.2 respectively. We also give a constant-competitive online algorithm for the case of uniform deadlines (but arbitrary job lengths); to the best of our knowledge, no such algorithm was known previously. Finally, we consider the complimentary problem of throughput maximization where the goal is to maximize the sum of weights of scheduled jobs on a fixed set of identical machines (introduced by Bar-Noy et al. STOC 1999). We give a randomized online algorithm for this problem with a competitive ratio of e/e-1; previous results achieved this bound only for the case of a single machine or in the limit of an infinite number of machines.

preprint2014arXiv

Parallel Algorithms for Geometric Graph Problems

We give algorithms for geometric graph problems in the modern parallel models inspired by MapReduce. For example, for the Minimum Spanning Tree (MST) problem over a set of points in the two-dimensional space, our algorithm computes a $(1+ε)$-approximate MST. Our algorithms work in a constant number of rounds of communication, while using total space and communication proportional to the size of the data (linear space and near linear time algorithms). In contrast, for general graphs, achieving the same result for MST (or even connectivity) remains a challenging open problem, despite drawing significant attention in recent years. We develop a general algorithmic framework that, besides MST, also applies to Earth-Mover Distance (EMD) and the transportation cost problem. Our algorithmic framework has implications beyond the MapReduce model. For example it yields a new algorithm for computing EMD cost in the plane in near-linear time, $n^{1+o_ε(1)}$. We note that while recently Sharathkumar and Agarwal developed a near-linear time algorithm for $(1+ε)$-approximating EMD, our algorithm is fundamentally different, and, for example, also solves the transportation (cost) problem, raised as an open question in their work. Furthermore, our algorithm immediately gives a $(1+ε)$-approximation algorithm with $n^δ$ space in the streaming-with-sorting model with $1/δ^{O(1)}$ passes. As such, it is tempting to conjecture that the parallel models may also constitute a concrete playground in the quest for efficient algorithms for EMD (and other similar problems) in the vanilla streaming model, a well-known open problem.

preprint2013arXiv

The Round Complexity of Small Set Intersection

The set disjointness problem is one of the most fundamental and well-studied problems in communication complexity. In this problem Alice and Bob hold sets $S, T \subseteq [n]$, respectively, and the goal is to decide if $S \cap T = \emptyset$. Reductions from set disjointness are a canonical way of proving lower bounds in data stream algorithms, data structures, and distributed computation. In these applications, often the set sizes $|S|$ and $|T|$ are bounded by a value $k$ which is much smaller than $n$. This is referred to as small set disjointness. A major restriction in the above applications is the number of rounds that the protocol can make, which, e.g., translates to the number of passes in streaming applications. A fundamental question is thus in understanding the round complexity of the small set disjointness problem. For an essentially equivalent problem, called OR-Equality, Brody et. al showed that with $r$ rounds of communication, the randomized communication complexity is $Ω(k \ilog^r k)$, where$\ilog^r k$ denotes the $r$-th iterated logarithm function. Unfortunately their result requires the error probability of the protocol to be $1/k^{Θ(1)}$. Since naïve amplification of the success probability of a protocol from constant to $1-1/k^{Θ(1)}$ blows up the communication by a $Θ(\log k)$ factor, this destroys their improvements over the well-known lower bound of $Ω(k)$ which holds for any number of rounds. They pose it as an open question to achieve the same $Ω(k \ilog^r k)$ lower bound for protocols with constant error probability. We answer this open question by showing that the $r$-round randomized communication complexity of ${\sf OREQ}_{n,k}$, and thus also of small set disjointness, with {\it constant error probability} is $Ω(k \ilog^r k)$, asymptotically matching known upper bounds for ${\sf OREQ}_{n,k}$ and small set disjointness.

preprint2012arXiv

Accurate and Efficient Private Release of Datacubes and Contingency Tables

A central problem in releasing aggregate information about sensitive data is to do so accurately while providing a privacy guarantee on the output. Recent work focuses on the class of linear queries, which include basic counting queries, data cubes, and contingency tables. The goal is to maximize the utility of their output, while giving a rigorous privacy guarantee. Most results follow a common template: pick a "strategy" set of linear queries to apply to the data, then use the noisy answers to these queries to reconstruct the queries of interest. This entails either picking a strategy set that is hoped to be good for the queries, or performing a costly search over the space of all possible strategies. In this paper, we propose a new approach that balances accuracy and efficiency: we show how to improve the accuracy of a given query set by answering some strategy queries more accurately than others. This leads to an efficient optimal noise allocation for many popular strategies, including wavelets, hierarchies, Fourier coefficients and more. For the important case of marginal queries we show that this strictly improves on previous methods, both analytically and empirically. Our results also extend to ensuring that the returned query answers are consistent with an (unknown) data set at minimal extra cost in terms of time and noise.

preprint2012arXiv

Learning pseudo-Boolean k-DNF and Submodular Functions

We prove that any submodular function f: {0,1}^n -> {0,1,...,k} can be represented as a pseudo-Boolean 2k-DNF formula. Pseudo-Boolean DNFs are a natural generalization of DNF representation for functions with integer range. Each term in such a formula has an associated integral constant. We show that an analog of Hastad's switching lemma holds for pseudo-Boolean k-DNFs if all constants associated with the terms of the formula are bounded. This allows us to generalize Mansour's PAC-learning algorithm for k-DNFs to pseudo-Boolean k-DNFs, and hence gives a PAC-learning algorithm with membership queries under the uniform distribution for submodular functions of the form f:{0,1}^n -> {0,1,...,k}. Our algorithm runs in time polynomial in n, k^{O(k \log k / ε)}, 1/εand log(1/δ) and works even in the agnostic setting. The line of previous work on learning submodular functions [Balcan, Harvey (STOC '11), Gupta, Hardt, Roth, Ullman (STOC '11), Cheraghchi, Klivans, Kothari, Lee (SODA '12)] implies only n^{O(k)} query complexity for learning submodular functions in this setting, for fixed epsilon and delta. Our learning algorithm implies a property tester for submodularity of functions f:{0,1}^n -> {0, ..., k} with query complexity polynomial in n for k=O((\log n/ \loglog n)^{1/2}) and constant proximity parameter ε.

preprint2010arXiv

Steiner Transitive-Closure Spanners of d-Dimensional Posets

Given a directed graph G and an integer k >= 1, a k-transitive-closure-spanner (k-TCspanner) of G is a directed graph H that has (1) the same transitive-closure as G and (2) diameter at most k. In some applications, the shortcut paths added to the graph in order to obtain small diameter can use Steiner vertices, that is, vertices not in the original graph G. The resulting spanner is called a Steiner transitive-closure spanner (Steiner TC-spanner). Motivated by applications to property reconstruction and access control hierarchies, we concentrate on Steiner TC-spanners of directed acyclic graphs or, equivalently, partially ordered sets. In these applications, the goal is to find a sparsest Steiner k-TC-spanner of a poset G for a given k and G. The focus of this paper is the relationship between the dimension of a poset and the size of its sparsest Steiner TCspanner. The dimension of a poset G is the smallest d such that G can be embedded into a d-dimensional directed hypergrid via an order-preserving embedding. We present a nearly tight lower bound on the size of Steiner 2-TC-spanners of d-dimensional directed hypergrids. It implies better lower bounds on the complexity of local reconstructors of monotone functions and functions with low Lipschitz constant. The proof of the lower bound constructs a dual solution to a linear programming relaxation of the Steiner 2-TC-spanner problem. We also show that one can efficiently construct a Steiner 2-TC-spanner, of size matching the lower bound, for any low-dimensional poset. Finally, we present a lower bound on the size of Steiner k-TC-spanners of d-dimensional posets that shows that the best-known construction, due to De Santis et al., cannot be improved significantly.