Source author record

Michael Saks

Michael Saks appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Computational Complexity Machine Learning Discrete Mathematics Information Theory math.CO math.IT math.PR math.RA

Catalog footprint

What is connected

14works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time

Edit distance is a measure of similarity of two strings based on the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. The edit distance can be computed exactly using a dynamic programming algorithm that runs in quadratic time. Andoni, Krauthgamer, and Onak (2010) gave a nearly linear time algorithm that approximates edit distance within an approximation factor $\text{poly}(\log n)$. In this paper, we provide an algorithm with running time $\tilde{O}(n^{2-2/7})$ that approximates the edit distance within a constant factor.

preprint2020arXiv

On the rational relationships among pseudo-roots of a non-commutative polynomial

For a non-commutative ring R, we consider factorizations of polynomials in R[t] where t is a central variable. A pseudo-root of a polynomial p(t) is an element x in R, for which there exist polynomials q(t) and s(t) such that p(t)=q(t)(t-x)s(t). We investigate the rational relationships that hold among the pseudo-roots of p(t) by using the diamond operations for cover graphs of modular lattices.

preprint2016arXiv

Noisy population recovery in polynomial time

In the noisy population recovery problem of Dvir et al., the goal is to learn an unknown distribution $f$ on binary strings of length $n$ from noisy samples. For some parameter $μ\in [0,1]$, a noisy sample is generated by flipping each coordinate of a sample from $f$ independently with probability $(1-μ)/2$. We assume an upper bound $k$ on the size of the support of the distribution, and the goal is to estimate the probability of any string to within some given error $\varepsilon$. It is known that the algorithmic complexity and sample complexity of this problem are polynomially related to each other. We show that for $μ> 0$, the sample complexity (and hence the algorithmic complexity) is bounded by a polynomial in $k$, $n$ and $1/\varepsilon$ improving upon the previous best result of $\mathsf{poly}(k^{\log\log k},n,1/\varepsilon)$ due to Lovett and Zhang. Our proof combines ideas from Lovett and Zhang with a \emph{noise attenuated} version of Möbius inversion. In turn, the latter crucially uses the construction of \emph{robust local inverse} due to Moitra and Saks.

preprint2015arXiv

A communication game related to the sensitivity conjecture

One of the major outstanding foundational problems about boolean functions is the sensitivity conjecture, which (in one of its many forms) asserts that the degree of a boolean function (i.e. the minimum degree of a real polynomial that interpolates the function) is bounded above by some fixed power of its sensitivity (which is the maximum vertex degree of the graph defined on the inputs where two inputs are adjacent if they differ in exactly one coordinate and their function values are different). We propose an attack on the sensitivity conjecture in terms of a novel two-player communication game. A lower bound of the form $n^{Ω(1)}$ on the cost of this game would imply the sensitivity conjecture. To investigate the problem of bounding the cost of the game, three natural (stronger) variants of the question are considered. For two of these variants, protocols are presented that show that the hoped for lower bound does not hold. These protocols satisfy a certain monotonicity property, and (in contrast to the situation for the two variants) we show that the cost of any monotone protocol satisfies a strong lower bound. There is an easy upper bound of $\sqrt{n}$ on the cost of the game. We also improve slightly on this upper bound.

preprint2015arXiv

Efficient indexing of necklaces and irreducible polynomials over finite fields

We study the problem of indexing irreducible polynomials over finite fields, and give the first efficient algorithm for this problem. Specifically, we show the existence of poly(n, log q)-size circuits that compute a bijection between {1, ... , |S|} and the set S of all irreducible, monic, univariate polynomials of degree n over a finite field F_q. This has applications in pseudorandomness, and answers an open question of Alon, Goldreich, Håstad and Peralta[AGHP]. Our approach uses a connection between irreducible polynomials and necklaces ( equivalence classes of strings under cyclic rotation). Along the way, we give the first efficient algorithm for indexing necklaces of a given length over a given alphabet, which may be of independent interest.

preprint2014arXiv

Hellinger volume and number-on-the-forehead communication complexity

Information-theoretic methods have proven to be a very powerful tool in communication complexity, in particular giving an elegant proof of the linear lower bound for the two-party disjointness function, and tight lower bounds on disjointness in the multi-party number-in-the-hand (NIH) model. In this paper, we study the applicability of information theoretic methods to the multi-party number-on-the-forehead model (NOF), where determining the complexity of disjointness remains an important open problem. There are two basic parts to the NIH disjointness lower bound: a direct sum theorem and a lower bound on the one-bit AND function using a beautiful connection between Hellinger distance and protocols revealed by Bar-Yossef, Jayram, Kumar and Sivakumar [BYJKS04]. Inspired by this connection, we introduce the notion of Hellinger volume. We show that it lower bounds the information cost of multi-party NOF protocols and provide a small toolbox that allows one to manipulate several Hellinger volume terms and lower bound a Hellinger volume when the distributions involved satisfy certain conditions. In doing so, we prove a new upper bound on the difference between the arithmetic mean and the geometric mean in terms of relative entropy. We then apply these new tools to obtain a lower bound on the informational complexity of the AND_k function in the NOF setting. Finally, we discuss the difficulties of proving a direct sum theorem for information cost in the NOF model.

preprint2013arXiv

A Polynomial Time Algorithm for Lossy Population Recovery

We give a polynomial time algorithm for the lossy population recovery problem. In this problem, the goal is to approximately learn an unknown distribution on binary strings of length $n$ from lossy samples: for some parameter $μ$ each coordinate of the sample is preserved with probability $μ$ and otherwise is replaced by a `?'. The running time and number of samples needed for our algorithm is polynomial in $n$ and $1/\varepsilon$ for each fixed $μ>0$. This improves on algorithm of Wigderson and Yehudayoff that runs in quasi-polynomial time for any $μ> 0$ and the polynomial time algorithm of Dvir et al which was shown to work for $μ\gtrapprox 0.30$ by Batman et al. In fact, our algorithm also works in the more general framework of Batman et al. in which there is no a priori bound on the size of the support of the distribution. The algorithm we analyze is implicit in previous work; our main contribution is to analyze the algorithm by showing (via linear programming duality and connections to complex analysis) that a certain matrix associated with the problem has a robust local inverse even though its condition number is exponentially small. A corollary of our result is the first polynomial time algorithm for learning DNFs in the restriction access model of Dvir et al.

preprint2013arXiv

Composition limits and separating examples for some Boolean function complexity measures

Block sensitivity ($bs(f)$), certificate complexity ($C(f)$) and fractional certificate complexity ($C^*(f)$) are three fundamental combinatorial measures of complexity of a boolean function $f$. It has long been known that $bs(f) \leq C^{\ast}(f) \leq C(f) =O(bs(f)^2)$. We provide an infinite family of examples for which $C(f)$ grows quadratically in $C^{\ast}(f)$ (and also $bs(f)$) giving optimal separations between these measures. Previously the biggest separation known was $C(f)=C^{\ast}(f)^{\log_{4.5}5}$. We also give a family of examples for which $C^{\ast}(f)=Ω(bs(f)^{3/2})$. These examples are obtained by composing boolean functions in various ways. Here the composition $f \circ g$ of $f$ with $g$ is obtained by substituting for each variable of $f$ a copy of $g$ on disjoint sets of variables. To construct and analyse these examples we systematically investigate the behaviour under function composition of these measures and also the sensitivity measure $s(f)$. The measures $s(f)$, $C(f)$ and $C^{\ast}(f)$ behave nicely under composition: they are submultiplicative (where measure $m$ is submultiplicative if $m(f \circ g) \leq m(f)m(g)$) with equality holding under some fairly general conditions. The measure $bs(f)$ is qualitatively different: it is not submultiplicative. This qualitative difference was not noticed in the previous literature and we correct some errors that appeared in previous papers. We define the composition limit of a measure $m$ at function $f$, $m^{\lim}(f)$ to be the limit as $k$ grows of $m(f^{(k)})^{1/k}$, where $f^{(k)}$ is the iterated composition of $f$ with itself $k$-times. For any function $f$ we show that $bs^{\lim}(f) = (C^*)^{\lim}(f)$ and characterize $s^{\lim}(f), (C^*)^{\lim}(f)$, and $C^{\lim}(f)$ in terms of the largest eigenvalue of a certain set of $2\times 2$ matrices associated with $f$.

preprint2013arXiv

Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance

Approximating the length of the longest increasing sequence (LIS) of an array is a well-studied problem. We study this problem in the data stream model, where the algorithm is allowed to make a single left-to-right pass through the array and the key resource to be minimized is the amount of additional memory used. We present an algorithm which, for any $δ> 0$, given streaming access to an array of length $n$ provides a $(1+δ)$-multiplicative approximation to the \emph{distance to monotonicity} ($n$ minus the length of the LIS), and uses only $O((\log^2 n)/δ)$ space. The previous best known approximation using polylogarithmic space was a multiplicative 2-factor. Our algorithm can be used to estimate the length of the LIS to within an additive $δn$ for any $δ>0$ while previous algorithms could only achieve additive error $n(1/2-o(1))$. Our algorithm is very simple, being just 3 lines of pseudocode, and has a small update time. It is essentially a polylogarithmic space approximate implementation of a classic dynamic program that computes the LIS. We also give a streaming algorithm for approximating $LCS(x,y)$, the length of the longest common subsequence between strings $x$ and $y$, each of length $n$. Our algorithm works in the asymmetric setting (inspired by \cite{AKO10}), in which we have random access to $y$ and streaming access to $x$, and runs in small space provided that no single symbol appears very often in $y$. More precisely, it gives an additive-$δn$ approximation to $LCS(x,y)$ (and hence also to $E(x,y) = n-LCS(x,y)$, the edit distance between $x$ and $y$ when insertions and deletions, but not substitutions, are allowed), with space complexity $O(k(\log^2 n)/δ)$, where $k$ is the maximum number of times any one symbol appears in $y$.

preprint2012arXiv

A Tail Bound for Read-k Families of Functions

We prove a Chernoff-like large deviation bound on the sum of non-independent random variables that have the following dependence structure. The variables $Y_1,...,Y_r$ are arbitrary Boolean functions of independent random variables $X_1,...,X_m$, modulo a restriction that every $X_i$ influences at most $k$ of the variables $Y_1,...,Y_r$.

preprint2012arXiv

Clustering is difficult only when it does not matter

Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {\em that can be clustered well}. More generally, despite the ubiquity and the great importance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from the perspective of computational complexity theory the clustering problem seems very hard. Numerous papers introduce various criteria and numerical measures to quantify the quality of a given clustering. The resulting conclusions are pessimistic, since it is computationally difficult to find an optimal clustering of a given data set, if we go by any of these popular criteria. In contrast, the practitioners' perspective is much more optimistic. Our explanation for this disparity of opinions is that complexity theory concentrates on the worst case, whereas in reality we only care for data sets that can be clustered well. We introduce a theoretical framework of clustering in metric spaces that revolves around a notion of "good clustering". We show that if a good clustering exists, then in many cases it can be efficiently found. Our conclusion is that contrary to popular belief, clustering should not be considered a hard task.

preprint2012arXiv

On Online Labeling with Polynomially Many Labels

In the online labeling problem with parameters n and m we are presented with a sequence of n keys from a totally ordered universe U and must assign each arriving key a label from the label set {1,2,...,m} so that the order of labels (strictly) respects the ordering on U. As new keys arrive it may be necessary to change the labels of some items; such changes may be done at any time at unit cost for each change. The goal is to minimize the total cost. An alternative formulation of this problem is the file maintenance problem, in which the items, instead of being labeled, are maintained in sorted order in an array of length m, and we pay unit cost for moving an item. For the case m=cn for constant c>1, there are known algorithms that use at most O(n log(n)^2) relabelings in total [Itai, Konheim, Rodeh, 1981], and it was shown recently that this is asymptotically optimal [Bulánek, Koucký, Saks, 2012]. For the case of m=Θ(n^C) for C>1, algorithms are known that use O(n log n) relabelings. A matching lower bound was claimed in [Dietz, Seiferas, Zhang, 2004]. That proof involved two distinct steps: a lower bound for a problem they call prefix bucketing and a reduction from prefix bucketing to online labeling. The reduction seems to be incorrect, leaving a (seemingly significant) gap in the proof. In this paper we close the gap by presenting a correct reduction to prefix bucketing. Furthermore we give a simplified and improved analysis of the prefix bucketing lower bound. This improvement allows us to extend the lower bounds for online labeling to the case where the number m of labels is superpolynomial in n. In particular, for superpolynomial m we get an asymptotically optimal lower bound Ω((n log n) / (log log m - log log n)).

preprint2012arXiv

On the practically interesting instances of MAXCUT

The complexity of a computational problem is traditionally quantified based on the hardness of its worst case. This approach has many advantages and has led to a deep and beautiful theory. However, from the practical perspective, this leaves much to be desired. In application areas, practically interesting instances very often occupy just a tiny part of an algorithm's space of instances, and the vast majority of instances are simply irrelevant. Addressing these issues is a major challenge for theoretical computer science which may make theory more relevant to the practice of computer science. Following Bilu and Linial, we apply this perspective to MAXCUT, viewed as a clustering problem. Using a variety of techniques, we investigate practically interesting instances of this problem. Specifically, we show how to solve in polynomial time distinguished, metric, expanding and dense instances of MAXCUT under mild stability assumptions. In particular, $(1+ε)$-stability (which is optimal) suffices for metric and dense MAXCUT. We also show how to solve in polynomial time $Ω(\sqrt{n})$-stable instances of MAXCUT, substantially improving the best previously known result.

preprint2011arXiv

Tight lower bounds for online labeling problem

We consider the file maintenance problem (also called the online labeling problem) in which n integer items from the set {1,...,r} are to be stored in an array of size m >= n. The items are presented sequentially in an arbitrary order, and must be stored in the array in sorted order (but not necessarily in consecutive locations in the array). Each new item must be stored in the array before the next item is received. If r<=m then we can simply store item j in location j but if r>m then we may have to shift the location of stored items to make space for a newly arrived item. The algorithm is charged each time an item is stored in the array, or moved to a new location. The goal is to minimize the total number of such moves done by the algorithm. This problem is non-trivial when n=<m<r. In the case that m=Cn for some C>1, algorithms for this problem with cost O(log(n)^2) per item have been given [IKR81, Wil92, BCD+02]. When m=n, algorithms with cost O(log(n)^3) per item were given [Zha93, BS07]. In this paper we prove lower bounds that show that these algorithms are optimal, up to constant factors. Previously, the only lower bound known for this range of parameters was a lower bound of Ω(log(n)^2) for the restricted class of smooth algorithms [DSZ05a, Zha93]. We also provide an algorithm for the sparse case: If the number of items is polylogarithmic in the array size then the problem can be solved in amortized constant time per item.

Michael Saks

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time

On the rational relationships among pseudo-roots of a non-commutative polynomial

Noisy population recovery in polynomial time

A communication game related to the sensitivity conjecture

Efficient indexing of necklaces and irreducible polynomials over finite fields

Hellinger volume and number-on-the-forehead communication complexity

A Polynomial Time Algorithm for Lossy Population Recovery

Composition limits and separating examples for some Boolean function complexity measures

Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance

A Tail Bound for Read-k Families of Functions

Clustering is difficult only when it does not matter

On Online Labeling with Polynomially Many Labels

On the practically interesting instances of MAXCUT

Tight lower bounds for online labeling problem