Source author record

Nir Ailon

Nir Ailon appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computational Complexity Data Structures and Algorithms Computational Geometry Computer Science and Game Theory Information Theory math.IT Numerical Analysis

Catalog footprint

What is connected

16works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

Spatial contrasting for deep unsupervised learning

Convolutional networks have marked their place over the last few years as the best performing model for various visual tasks. They are, however, most suited for supervised learning from large amounts of labeled data. Previous attempts have been made to use unlabeled data to improve model performance by applying unsupervised techniques. These attempts require different architectures and training methods. In this work we present a novel approach for unsupervised training of Convolutional networks that is based on contrasting between spatial regions within images. This criterion can be employed within conventional neural networks and trained using standard techniques such as SGD and back-propagation, thus complementing supervised methods.

preprint2015arXiv

Tighter Fourier Transform Complexity Tradeoffs

The Fourier Transform is one of the most important linear transformations used in science and engineering. Cooley and Tukey's Fast Fourier Transform (FFT) from 1964 is a method for computing this transformation in time $O(n\log n)$. Achieving a matching lower bound in a reasonable computational model is one of the most important open problems in theoretical computer science. In 2014, improving on his previous work, Ailon showed that if an algorithm speeds up the FFT by a factor of $b=b(n)\geq 1$, then it must rely on computing, as an intermediate "bottleneck" step, a linear mapping of the input with condition number $Ω(b(n))$. Our main result shows that a factor $b$ speedup implies existence of not just one but $Ω(n)$ $b$-ill conditioned bottlenecks occurring at $Ω(n)$ different steps, each causing information from independent (orthogonal) components of the input to either overflow or underflow. This provides further evidence that beating FFT is hard. Our result also gives the first quantitative tradeoff between computation speed and information loss in Fourier computation on fixed word size architectures. The main technical result is an entropy analysis of the Fourier transform under transformations of low trace, which is interesting in its own right.

preprint2014arXiv

A tight lower bound instance for k-means++ in constant dimension

The k-means++ seeding algorithm is one of the most popular algorithms that is used for finding the initial $k$ centers when using the k-means heuristic. The algorithm is a simple sampling procedure and can be described as follows: Pick the first center randomly from the given points. For $i > 1$, pick a point to be the $i^{th}$ center with probability proportional to the square of the Euclidean distance of this point to the closest previously $(i-1)$ chosen centers. The k-means++ seeding algorithm is not only simple and fast but also gives an $O(\log{k})$ approximation in expectation as shown by Arthur and Vassilvitskii. There are datasets on which this seeding algorithm gives an approximation factor of $Ω(\log{k})$ in expectation. However, it is not clear from these results if the algorithm achieves good approximation factor with reasonably high probability (say $1/poly(k)$). Brunsch and Röglin gave a dataset where the k-means++ seeding algorithm achieves an $O(\log{k})$ approximation ratio with probability that is exponentially small in $k$. However, this and all other known lower-bound examples are high dimensional. So, an open problem was to understand the behavior of the algorithm on low dimensional datasets. In this work, we give a simple two dimensional dataset on which the seeding algorithm achieves an $O(\log{k})$ approximation ratio with probability exponentially small in $k$. This solves open problems posed by Mahajan et al. and by Brunsch and Röglin.

preprint2014arXiv

An n\log n Lower Bound for Fourier Transform Computation in the Well Conditioned Model

Obtaining a non-trivial (super-linear) lower bound for computation of the Fourier transform in the linear circuit model has been a long standing open problem for over 40 years. An early result by Morgenstern from 1973, provides an $Ω(n \log n)$ lower bound for the unnormalized Fourier transform when the constants used in the computation are bounded. The proof uses a potential function related to a determinant. The result does not explain why the normalized Fourier transform (of unit determinant) should be difficult to compute in the same model. Hence, the result is not scale insensitive. More recently, Ailon (2013) showed that if only unitary 2-by-2 gates are used, and additionally no extra memory is allowed, then the normalized Fourier transform requires $Ω(n\log n)$ steps. This rather limited result is also sensitive to scaling, but highlights the complexity inherent in the Fourier transform arising from introducing entropy, unlike, say, the identity matrix (which is as complex as the Fourier transform using Morgenstern's arguments, under proper scaling). In this work we extend the arguments of Ailon (2013). In the first extension, which is also the main contribution, we provide a lower bound for computing any scaling of the Fourier transform. Our restriction is that, the composition of all gates up to any point must be a well conditioned linear transformation. The lower bound is $Ω(R^{-1}n\log n)$, where $R$ is the uniform condition number. Second, we assume extra space is allowed, as long as it contains information of bounded norm at the end of the computation. The main technical contribution is an extension of matrix entropy used in Ailon (2013) for unitary matrices to a potential function computable for any matrix, using Shannon entropy on "quasi-probabilities".

preprint2014arXiv

Bandit Online Optimization Over the Permutahedron

The permutahedron is the convex polytope with vertex set consisting of the vectors $(π(1),\dots, π(n))$ for all permutations (bijections) $π$ over $\{1,\dots, n\}$. We study a bandit game in which, at each step $t$, an adversary chooses a hidden weight weight vector $s_t$, a player chooses a vertex $π_t$ of the permutahedron and suffers an observed loss of $\sum_{i=1}^n π(i) s_t(i)$. A previous algorithm CombBand of Cesa-Bianchi et al (2009) guarantees a regret of $O(n\sqrt{T \log n})$ for a time horizon of $T$. Unfortunately, CombBand requires at each step an $n$-by-$n$ matrix permanent approximation to within improved accuracy as $T$ grows, resulting in a total running time that is super linear in $T$, making it impractical for large time horizons. We provide an algorithm of regret $O(n^{3/2}\sqrt{T})$ with total time complexity $O(n^3T)$. The ideas are a combination of CombBand and a recent algorithm by Ailon (2013) for online optimization over the permutahedron in the full information setting. The technical core is a bound on the variance of the Plackett-Luce noisy sorting process's "pseudo loss". The bound is obtained by establishing positive semi-definiteness of a family of 3-by-3 matrices generated from rational functions of exponentials of 3 parameters.

preprint2014arXiv

Reducing Dueling Bandits to Cardinal Bandits

We present algorithms for reducing the Dueling Bandits problem to the conventional (stochastic) Multi-Armed Bandits problem. The Dueling Bandits problem is an online model of learning with ordinal feedback of the form "A is preferred to B" (as opposed to cardinal feedback like "A has value 2.5"), giving it wide applicability in learning from implicit user feedback and revealed and stated preferences. In contrast to existing algorithms for the Dueling Bandits problem, our reductions -- named $\Doubler$, $\MultiSbm$ and $\DoubleSbm$ -- provide a generic schema for translating the extensive body of known results about conventional Multi-Armed Bandit algorithms to the Dueling Bandits setting. For $\Doubler$ and $\MultiSbm$ we prove regret upper bounds in both finite and infinite settings, and conjecture about the performance of $\DoubleSbm$ which empirically outperforms the other two as well as previous algorithms in our experiments. In addition, we provide the first almost optimal regret bound in terms of second order terms, such as the differences between the values of the arms.

preprint2013arXiv

A Lower Bound for Fourier Transform Computation in a Linear Model Over 2x2 Unitary Gates Using Matrix Entropy

Obtaining a non-trivial (super-linear) lower bound for computation of the Fourier transform in the linear circuit model has been a long standing open problem. All lower bounds so far have made strong restrictions on the computational model. One of the most well known results, by Morgenstern from 1973, provides an $Ω(n \log n)$ lower bound for the \emph{unnormalized} FFT when the constants used in the computation are bounded. The proof uses a potential function related to a determinant. The determinant of the unnormalized Fourier transform is $n^{n/2}$, and thus by showing that it can grow by at most a constant factor after each step yields the result. This classic result, however, does not explain why the \emph{normalized} Fourier transform, which has a unit determinant, should take $Ω(n\log n)$ steps to compute. In this work we show that in a layered linear circuit model restricted to unitary $2\times 2$ gates, one obtains an $Ω(n\log n)$ lower bound. The well known FFT works in this model. The main argument concluded from this work is that a potential function that might eventually help proving the $Ω(n\log n)$ conjectured lower bound for computation of Fourier transform is not related to matrix determinant, but rather to a notion of matrix entropy.

preprint2013arXiv

Breaking the Small Cluster Barrier of Graph Clustering

This paper investigates graph clustering in the planted cluster model in the presence of {\em small clusters}. Traditional results dictate that for an algorithm to provably correctly recover the clusters, {\em all} clusters must be sufficiently large (in particular, $\tildeΩ(\sqrt{n})$ where $n$ is the number of nodes of the graph). We show that this is not really a restriction: by a more refined analysis of the trace-norm based recovery approach proposed in Jalali et al. (2011) and Chen et al. (2012), we prove that small clusters, under certain mild assumptions, do not hinder recovery of large ones. Based on this result, we further devise an iterative algorithm to recover {\em almost all clusters} via a "peeling strategy", i.e., recover large clusters first, leading to a reduced problem, and repeat this procedure. These results are extended to the {\em partial observation} setting, in which only a (chosen) part of the graph is observed.The peeling strategy gives rise to an active learning algorithm, in which edges adjacent to smaller clusters are queried more often as large clusters are learned (and removed). From a high level, this paper sheds novel insights on high-dimensional statistics and learning structured data, by presenting a structured matrix learning problem for which a one shot convex relaxation approach necessarily fails, but a carefully constructed sequence of convex relaxationsdoes the job.

preprint2013arXiv

Fast and RIP-optimal transforms

We study constructions of $k \times n$ matrices $A$ that both (1) satisfy the restricted isometry property (RIP) at sparsity $s$ with optimal parameters, and (2) are efficient in the sense that only $O(n\log n)$ operations are required to compute $Ax$ given a vector $x$. Our construction is based on repeated application of independent transformations of the form $DH$, where $H$ is a Hadamard or Fourier transform and $D$ is a diagonal matrix with random $\{+1,-1\}$ elements on the diagonal, followed by any $k \times n$ matrix of orthonormal rows (e.g.\ selection of $k$ coordinates). We provide guarantees (1) and (2) for a larger regime of parameters for which such constructions were previously unknown. Additionally, our construction does not suffer from the extra poly-logarithmic factor multiplying the number of observations $k$ as a function of the sparsity $s$, as present in the currently best known RIP estimates for partial random Fourier matrices and other classes of structured random matrices.

preprint2013arXiv

Online Ranking: Discrete Choice, Spearman Correlation and Other Feedback

Given a set $V$ of $n$ objects, an online ranking system outputs at each time step a full ranking of the set, observes a feedback of some form and suffers a loss. We study the setting in which the (adversarial) feedback is an element in $V$, and the loss is the position (0th, 1st, 2nd...) of the item in the outputted ranking. More generally, we study a setting in which the feedback is a subset $U$ of at most $k$ elements in $V$, and the loss is the sum of the positions of those elements. We present an algorithm of expected regret $O(n^{3/2}\sqrt{Tk})$ over a time horizon of $T$ steps with respect to the best single ranking in hindsight. This improves previous algorithms and analyses either by a factor of either $Ω(\sqrt{k})$, a factor of $Ω(\sqrt{\log n})$ or by improving running time from quadratic to $O(n\log n)$ per round. We also prove a matching lower bound. Our techniques also imply an improved regret bound for online rank aggregation over the Spearman correlation measure, and to other more complex ranking loss functions.

preprint2012arXiv

A note on: No need to choose: How to get both a PTAS and Sublinear Query Complexity

We revisit various PTAS's (Polynomial Time Approximation Schemes) for minimization versions of dense problems, and show that they can be performed with sublinear query complexity. This means that not only do we obtain a (1+eps)-approximation to the NP-Hard problems in polynomial time, but also avoid reading the entire input. This setting is particularly advantageous when the price of reading parts of the input is high, as is the case, for examples, where humans provide the input. Trading off query complexity with approximation is the raison d'etre of the field of learning theory, and of the ERM (Empirical Risk Minimization) setting in particular. A typical ERM result, however, does not deal with computational complexity. We discuss two particular problems for which (a) it has already been shown that sublinear querying is sufficient for obtaining a (1 + eps)-approximation using unlimited computational power (an ERM result), and (b) with full access to input, we could get a (1+eps)-approximation in polynomial time (a PTAS). Here we show that neither benefit need be sacrificed. We get a PTAS with efficient query complexity.

preprint2012arXiv

Active Learning of Custering with Side Information Using $\eps$-Smooth Relative Regret Approximations

Clustering is considered a non-supervised learning setting, in which the goal is to partition a collection of data points into disjoint clusters. Often a bound $k$ on the number of clusters is given or assumed by the practitioner. Many versions of this problem have been defined, most notably $k$-means and $k$-median. An underlying problem with the unsupervised nature of clustering it that of determining a similarity function. One approach for alleviating this difficulty is known as clustering with side information, alternatively, semi-supervised clustering. Here, the practitioner incorporates side information in the form of "must be clustered" or "must be separated" labels for data point pairs. Each such piece of information comes at a "query cost" (often involving human response solicitation). The collection of labels is then incorporated in the usual clustering algorithm as either strict or as soft constraints, possibly adding a pairwise constraint penalty function to the chosen clustering objective. Our work is mostly related to clustering with side information. We ask how to choose the pairs of data points. Our analysis gives rise to a method provably better than simply choosing them uniformly at random. Roughly speaking, we show that the distribution must be biased so as more weight is placed on pairs incident to elements in smaller clusters in some optimal solution. Of course we do not know the optimal solution, hence we don't know the bias. Using the recently introduced method of $\eps$-smooth relative regret approximations of Ailon, Begleiter and Ezra, we can show an iterative process that improves both the clustering and the bias in tandem. The process provably converges to the optimal solution faster (in terms of query cost) than an algorithm selecting pairs uniformly.

preprint2012arXiv

Active Learning Using Smooth Relative Regret Approximations with Applications

The disagreement coefficient of Hanneke has become a central data independent invariant in proving active learning rates. It has been shown in various ways that a concept class with low complexity together with a bound on the disagreement coefficient at an optimal solution allows active learning rates that are superior to passive learning ones. We present a different tool for pool based active learning which follows from the existence of a certain uniform version of low disagreement coefficient, but is not equivalent to it. In fact, we present two fundamental active learning problems of significant interest for which our approach allows nontrivial active learning bounds. However, any general purpose method relying on the disagreement coefficient bounds only fails to guarantee any useful bounds for these problems. The tool we use is based on the learner's ability to compute an estimator of the difference between the loss of any hypotheses and some fixed "pivotal" hypothesis to within an absolute error of at most $\eps$ times the

preprint2011arXiv

An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

We study the problem of learning to rank from pairwise preferences, and solve a long-standing open problem that has led to development of many heuristics but no provable results for our particular problem. Given a set $V$ of $n$ elements, we wish to linearly order them given pairwise preference labels. A pairwise preference label is obtained as a response, typically from a human, to the question "which if preferred, u or v?$ for two elements $u,v\in V$. We assume possible non-transitivity paradoxes which may arise naturally due to human mistakes or irrationality. The goal is to linearly order the elements from the most preferred to the least preferred, while disagreeing with as few pairwise preference labels as possible. Our performance is measured by two parameters: The loss and the query complexity (number of pairwise preference labels we obtain). This is a typical learning problem, with the exception that the space from which the pairwise preferences is drawn is finite, consisting of ${n\choose 2}$ possibilities only. We present an active learning algorithm for this problem, with query bounds significantly beating general (non active) bounds for the same error guarantee, while almost achieving the information theoretical lower bound. Our main construct is a decomposition of the input s.t. (i) each block incurs high loss at optimum, and (ii) the optimal solution respecting the decomposition is not much worse than the true opt. The decomposition is done by adapting a recent result by Kenyon and Schudy for a related combinatorial optimization problem to the query efficient setting. We thus settle an open problem posed by learning-to-rank theoreticians and practitioners: What is a provably correct way to sample preference labels? To further show the power and practicality of our solution, we show how to use it in concert with an SVM relaxation.

preprint2010arXiv

An Improved Algorithm for Bipartite Correlation Clustering

Bipartite Correlation clustering is the problem of generating a set of disjoint bi-cliques on a set of nodes while minimizing the symmetric difference to a bipartite input graph. The number or size of the output clusters is not constrained in any way. The best known approximation algorithm for this problem gives a factor of 11. This result and all previous ones involve solving large linear or semi-definite programs which become prohibitive even for modestly sized tasks. In this paper we present an improved factor 4 approximation algorithm to this problem using a simple combinatorial algorithm which does not require solving large convex programs. The analysis extends a method developed by Ailon, Charikar and Alantha in 2008, where a randomized pivoting algorithm was analyzed for obtaining a 3-approximation algorithm for Correlation Clustering, which is the same problem on graphs (not bipartite). The analysis for Correlation Clustering there required defining events for structures containing 3 vertices and using the probability of these events to produce a feasible solution to a dual of a certain natural LP bounding the optimal cost. It is tempting here to use sets of 4 vertices, which are the smallest structures for which contradictions arise for Bipartite Correlation Clustering. This simple idea, however, appears to be evasive. We show that, by modifying the LP, we can analyze algorithms which take into consideration subgraph structures of unbounded size. We believe our techniques are interesting in their own right, and may be used for other problems as well.

preprint2010arXiv

Self-Improving Algorithms

We investigate ways in which an algorithm can improve its expected performance by fine-tuning itself automatically with respect to an unknown input distribution D. We assume here that D is of product type. More precisely, suppose that we need to process a sequence I_1, I_2, ... of inputs I = (x_1, x_2, ..., x_n) of some fixed length n, where each x_i is drawn independently from some arbitrary, unknown distribution D_i. The goal is to design an algorithm for these inputs so that eventually the expected running time will be optimal for the input distribution D = D_1 * D_2 * ... * D_n. We give such self-improving algorithms for two problems: (i) sorting a sequence of numbers and (ii) computing the Delaunay triangulation of a planar point set. Both algorithms achieve optimal expected limiting complexity. The algorithms begin with a training phase during which they collect information about the input distribution, followed by a stationary regime in which the algorithms settle to their optimized incarnations.

Nir Ailon

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

Spatial contrasting for deep unsupervised learning

Tighter Fourier Transform Complexity Tradeoffs

A tight lower bound instance for k-means++ in constant dimension

An n\log n Lower Bound for Fourier Transform Computation in the Well Conditioned Model

Bandit Online Optimization Over the Permutahedron

Reducing Dueling Bandits to Cardinal Bandits

A Lower Bound for Fourier Transform Computation in a Linear Model Over 2x2 Unitary Gates Using Matrix Entropy

Breaking the Small Cluster Barrier of Graph Clustering

Fast and RIP-optimal transforms

Online Ranking: Discrete Choice, Spearman Correlation and Other Feedback

A note on: No need to choose: How to get both a PTAS and Sublinear Query Complexity

Active Learning of Custering with Side Information Using $\eps$-Smooth Relative Regret Approximations

Active Learning Using Smooth Relative Regret Approximations with Applications

An Active Learning Algorithm for Ranking from Pairwise Preferences with an Almost Optimal Query Complexity

An Improved Algorithm for Bipartite Correlation Clustering

Self-Improving Algorithms