Source author record

Arturs Backurs

Arturs Backurs appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Computational Complexity quant-ph Computational Geometry Machine Learning Cryptography and Security Computation and Language Computer Science and Game Theory Information Theory math.IT

Catalog footprint

What is connected

15works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Differentially Private Fine-tuning of Language Models

We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially private adaptations of these approaches outperform previous private algorithms in three important dimensions: utility, privacy, and the computational and memory cost of private training. On many commonly studied datasets, the utility of private models approaches that of non-private models. For example, on the MNLI dataset we achieve an accuracy of $87.8\%$ using RoBERTa-Large and $83.5\%$ using RoBERTa-Base with a privacy budget of $ε= 6.7$. In comparison, absent privacy constraints, RoBERTa-Large achieves an accuracy of $90.2\%$. Our findings are similar for natural language generation tasks. Privately fine-tuning with DART, GPT-2-Small, GPT-2-Medium, GPT-2-Large, and GPT-2-XL achieve BLEU scores of 38.5, 42.0, 43.1, and 43.8 respectively (privacy budget of $ε= 6.8,δ=$ 1e-5) whereas the non-private baseline is $48.1$. All our experiments suggest that larger models are better suited for private fine-tuning: while they are well known to achieve superior accuracy non-privately, we find that they also better maintain their accuracy when privacy is introduced.

preprint2022arXiv

Differentially Private Model Compression

Recent papers have shown that large pre-trained language models (LLMs) such as BERT, GPT-2 can be fine-tuned on private data to achieve performance comparable to non-private models for many downstream Natural Language Processing (NLP) tasks while simultaneously guaranteeing differential privacy. The inference cost of these models -- which consist of hundreds of millions of parameters -- however, can be prohibitively large. Hence, often in practice, LLMs are compressed before they are deployed in specific applications. In this paper, we initiate the study of differentially private model compression and propose frameworks for achieving 50% sparsity levels while maintaining nearly full performance. We demonstrate these ideas on standard GLUE benchmarks using BERT models, setting benchmarks for future research on this topic.

preprint2020arXiv

Active Local Learning

In this work we consider active local learning: given a query point $x$, and active access to an unlabeled training set $S$, output the prediction $h(x)$ of a near-optimal $h \in H$ using significantly fewer labels than would be needed to actually learn $h$ fully. In particular, the number of label queries should be independent of the complexity of $H$, and the function $h$ should be well-defined, independent of $x$. This immediately also implies an algorithm for distance estimation: estimating the value $opt(H)$ from many fewer labels than needed to actually learn a near-optimal $h \in H$, by running local learning on a few random query points and computing the average error. For the hypothesis class consisting of functions supported on the interval $[0,1]$ with Lipschitz constant bounded by $L$, we present an algorithm that makes $O(({1 / ε^6}) \log(1/ε))$ label queries from an unlabeled pool of $O(({L / ε^4})\log(1/ε))$ samples. It estimates the distance to the best hypothesis in the class to an additive error of $ε$ for an arbitrary underlying distribution. We further generalize our algorithm to more than one dimensions. We emphasize that the number of labels used is independent of the complexity of the hypothesis class which depends on $L$. Furthermore, we give an algorithm to locally estimate the values of a near-optimal function at a few query points of interest with number of labels independent of $L$. We also consider the related problem of approximating the minimum error that can be achieved by the Nadaraya-Watson estimator under a linear diagonal transformation with eigenvalues coming from a small range. For a $d$-dimensional pointset of size $N$, our algorithm achieves an additive approximation of $ε$, makes $\tilde{O}({d}/{ε^2})$ queries and runs in $\tilde{O}({d^2}/{ε^{d+4}}+{dN}/{ε^2})$ time.

preprint2020arXiv

Submodular Clustering in Low Dimensions

We study a clustering problem where the goal is to maximize the coverage of the input points by $k$ chosen centers. Specifically, given a set of $n$ points $P \subseteq \mathbb{R}^d$, the goal is to pick $k$ centers $C \subseteq \mathbb{R}^d$ that maximize the service $ \sum_{p \in P}\mathsfφ\bigl( \mathsf{d}(p,C) \bigr) $ to the points $P$, where $\mathsf{d}(p,C)$ is the distance of $p$ to its nearest center in $C$, and $\mathsfφ$ is a non-increasing service function $\mathsfφ : \mathbb{R}^+ \to \mathbb{R}^+$. This includes problems of placing $k$ base stations as to maximize the total bandwidth to the clients -- indeed, the closer the client is to its nearest base station, the more data it can send/receive, and the target is to place $k$ base stations so that the total bandwidth is maximized. We provide an $n^{\varepsilon^{-O(d)}}$ time algorithm for this problem that achieves a $(1-\varepsilon)$-approximation. Notably, the runtime does not depend on the parameter $k$ and it works for an arbitrary non-increasing service function $\mathsfφ : \mathbb{R}^+ \to \mathbb{R}^+$.

preprint2016arXiv

Improving Viterbi is Hard: Better Runtimes Imply Faster Clique Algorithms

The classic algorithm of Viterbi computes the most likely path in a Hidden Markov Model (HMM) that results in a given sequence of observations. It runs in time $O(Tn^2)$ given a sequence of $T$ observations from a HMM with $n$ states. Despite significant interest in the problem and prolonged effort by different communities, no known algorithm achieves more than a polylogarithmic speedup. In this paper, we explain this difficulty by providing matching conditional lower bounds. We show that the Viterbi algorithm runtime is optimal up to subpolynomial factors even when the number of distinct observations is small. Our lower bounds are based on assumptions that the best known algorithms for the All-Pairs Shortest Paths problem (APSP) and for the Max-Weight $k$-Clique problem in edge-weighted graphs are essentially tight. Finally, using a recent algorithm by Green Larsen and Williams for online Boolean matrix-vector multiplication, we get a $2^{Ω(\sqrt {\log n})}$ speedup for the Viterbi algorithm when there are few distinct transition probabilities in the HMM.

preprint2016arXiv

Tight Hardness Results for Maximum Weight Rectangles

Given $n$ weighted points (positive or negative) in $d$ dimensions, what is the axis-aligned box which maximizes the total weight of the points it contains? The best known algorithm for this problem is based on a reduction to a related problem, the Weighted Depth problem [T. M. Chan, FOCS'13], and runs in time $O(n^d)$. It was conjectured [Barbay et al., CCCG'13] that this runtime is tight up to subpolynomial factors. We answer this conjecture affirmatively by providing a matching conditional lower bound. We also provide conditional lower bounds for the special case when points are arranged in a grid (a well studied problem known as Maximum Subarray problem) as well as for other related problems. All our lower bounds are based on assumptions that the best known algorithms for the All-Pairs Shortest Paths problem (APSP) and for the Max-Weight k-Clique problem in edge-weighted graphs are essentially optimal.

preprint2016arXiv

Which Regular Expression Patterns are Hard to Match?

Regular expressions constitute a fundamental notion in formal language theory and are frequently used in computer science to define search patterns. A classic algorithm for these problems constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an $O(mn)$ running time (where $m$ is the length of the pattern and $n$ is the length of the text). This running time can be improved slightly (by a polylogarithmic factor), but no significantly faster solutions are known. At the same time, much faster algorithms exist for various special cases of regular expressions, including dictionary matching, wildcard matching, subset matching, word break problem etc. In this paper, we show that the complexity of regular expression matching can be characterized based on its {\em depth} (when interpreted as a formula). Our results hold for expressions involving concatenation, OR, Kleene star and Kleene plus. For regular expressions of depth two (involving any combination of the above operators), we show the following dichotomy: matching and membership testing can be solved in near-linear time, except for "concatenations of stars", which cannot be solved in strongly sub-quadratic time assuming the Strong Exponential Time Hypothesis (SETH). For regular expressions of depth three the picture is more complex. Nevertheless, we show that all problems can either be solved in strongly sub-quadratic time, or cannot be solved in strongly sub-quadratic time assuming SETH. An intriguing special case of membership testing involves regular expressions of the form "a star of an OR of concatenations", e.g., $[a|ab|bc]^*$. This corresponds to the so-called {\em word break} problem, for which a dynamic programming algorithm with a runtime of (roughly) $O(n\sqrt{m})$ is known. We show that the latter bound is not tight and improve the runtime to $O(nm^{0.44\ldots})$.

preprint2015arXiv

If the Current Clique Algorithms are Optimal, so is Valiant's Parser

The CFG recognition problem is: given a context-free grammar $\mathcal{G}$ and a string $w$ of length $n$, decide if $w$ can be obtained from $\mathcal{G}$. This is the most basic parsing question and is a core computer science problem. Valiant's parser from 1975 solves the problem in $O(n^ω)$ time, where $ω<2.373$ is the matrix multiplication exponent. Dozens of parsing algorithms have been proposed over the years, yet Valiant's upper bound remains unbeaten. The best combinatorial algorithms have mildly subcubic $O(n^3/\log^3{n})$ complexity. Lee (JACM'01) provided evidence that fast matrix multiplication is needed for CFG parsing, and that very efficient and practical algorithms might be hard or even impossible to obtain. Lee showed that any algorithm for a more general parsing problem with running time $O(|\mathcal{G}|\cdot n^{3-\varepsilon})$ can be converted into a surprising subcubic algorithm for Boolean Matrix Multiplication. Unfortunately, Lee's hardness result required that the grammar size be $|\mathcal{G}|=Ω(n^6)$. Nothing was known for the more relevant case of constant size grammars. In this work, we prove that any improvement on Valiant's algorithm, even for constant size grammars, either in terms of runtime or by avoiding the inefficiencies of fast matrix multiplication, would imply a breakthrough algorithm for the $k$-Clique problem: given a graph on $n$ nodes, decide if there are $k$ that form a clique. Besides classifying the complexity of a fundamental problem, our reduction has led us to similar lower bounds for more modern and well-studied cubic time problems for which faster algorithms are highly desirable in practice: RNA Folding, a central problem in computational biology, and Dyck Language Edit Distance, answering an open question of Saha (FOCS'14).

preprint2015arXiv

Nearly-optimal bounds for sparse recovery in generic norms, with applications to $k$-median sketching

We initiate the study of trade-offs between sparsity and the number of measurements in sparse recovery schemes for generic norms. Specifically, for a norm $\|\cdot\|$, sparsity parameter $k$, approximation factor $K>0$, and probability of failure $P>0$, we ask: what is the minimal value of $m$ so that there is a distribution over $m \times n$ matrices $A$ with the property that for any $x$, given $Ax$, we can recover a $k$-sparse approximation to $x$ in the given norm with probability at least $1-P$? We give a partial answer to this problem, by showing that for norms that admit efficient linear sketches, the optimal number of measurements $m$ is closely related to the doubling dimension of the metric induced by the norm $\|\cdot\|$ on the set of all $k$-sparse vectors. By applying our result to specific norms, we cast known measurement bounds in our general framework (for the $\ell_p$ norms, $p \in [1,2]$) as well as provide new, measurement-efficient schemes (for the Earth-Mover Distance norm). The latter result directly implies more succinct linear sketches for the well-studied planar $k$-median clustering problem. Finally, our lower bound for the doubling dimension of the EMD norm enables us to address the open question of [Frahling-Sohler, STOC'05] about the space complexity of clustering problems in the dynamic streaming model.

preprint2015arXiv

Quadratic-Time Hardness of LCS and other Sequence Similarity Measures

Two important similarity measures between sequences are the longest common subsequence (LCS) and the dynamic time warping distance (DTWD). The computations of these measures for two given sequences are central tasks in a variety of applications. Simple dynamic programming algorithms solve these tasks in $O(n^2)$ time, and despite an extensive amount of research, no algorithms with significantly better worst case upper bounds are known. In this paper, we show that an $O(n^{2-ε})$ time algorithm, for some $ε>0$, for computing the LCS or the DTWD of two sequences of length $n$ over a constant size alphabet, refutes the popular Strong Exponential Time Hypothesis (SETH). Moreover, we show that computing the LCS of $k$ strings over an alphabet of size $O(k)$ cannot be done in $O(n^{k-ε})$ time, for any $ε>0$, under SETH. Finally, we also address the time complexity of approximating the DTWD of two strings in truly subquadratic time.

preprint2015arXiv

Subtree Isomorphism Revisited

The Subtree Isomorphism problem asks whether a given tree is contained in another given tree. The problem is of fundamental importance and has been studied since the 1960s. For some variants, e.g., ordered trees, near-linear time algorithms are known, but for the general case truly subquadratic algorithms remain elusive. Our first result is a reduction from the Orthogonal Vectors problem to Subtree Isomorphism, showing that a truly subquadratic algorithm for the latter refutes the Strong Exponential Time Hypothesis (SETH). In light of this conditional lower bound, we focus on natural special cases for which no truly subquadratic algorithms are known. We classify these cases against the quadratic barrier, showing in particular that: -- Even for binary, rooted trees, a truly subquadratic algorithm refutes SETH. -- Even for rooted trees of depth $O(\log\log{n})$, where $n$ is the total number of vertices, a truly subquadratic algorithm refutes SETH. -- For every constant $d$, there is a constant $ε_d>0$ and a randomized, truly subquadratic algorithm for degree-$d$ rooted trees of depth at most $(1+ ε_d) \log_{d}{n}$. In particular, there is an $O(\min\{ 2.85^h ,n^2 \})$ algorithm for binary trees of depth $h$. Our reductions utilize new "tree gadgets" that are likely useful for future SETH-based lower bounds for problems on trees. Our upper bounds apply a folklore result from randomized decision tree complexity.

preprint2012arXiv

Optimal quantum query bounds for almost all Boolean functions

We show that almost all n-bit Boolean functions have bounded-error quantum query complexity at least n/2, up to lower-order terms. This improves over an earlier n/4 lower bound of Ambainis, and shows that van Dam's oracle interrogation is essentially optimal for almost all functions. Our proof uses the fact that the acceptance probability of a T-query algorithm can be written as the sum of squares of degree-T polynomials.

preprint2011arXiv

Quantum strategies are better than classical in almost any XOR game

We initiate a study of random instances of nonlocal games. We show that quantum strategies are better than classical for almost any 2-player XOR game. More precisely, for large n, the entangled value of a random 2-player XOR game with n questions to every player is at least 1.21... times the classical value, for 1-o(1) fraction of all 2-player XOR games.

preprint2011arXiv

Search by quantum walks on two-dimensional grid without amplitude amplification

We study search by quantum walk on a finite two dimensional grid. The algorithm of Ambainis, Kempe, Rivosh (quant-ph/0402107) takes O(\sqrt{N log N}) steps and finds a marked location with probability O(1/log N) for grid of size \sqrt{N} * \sqrt{N}. This probability is small, thus amplitude amplification is needed to achieve Θ(1) success probability. The amplitude amplification adds an additional O(\sqrt{log N}) factor to the number of steps, making it O(\sqrt{N} log N). In this paper, we show that despite a small probability to find a marked location, the probability to be within an O(\sqrt{N}) neighbourhood (at an O(\sqrt[4]{N}) distance) of the marked location is Θ(1). This allows to skip amplitude amplification step and leads to an O(\sqrt{log N}) speed-up. We describe the results of numerical experiments supporting this idea, and we prove this fact analytically.

preprint2011arXiv

Worst case analysis of non-local games

Non-local games are studied in quantum information because they provide a simple way for proving the difference between the classical world and the quantum world. A non-local game is a cooperative game played by 2 or more players against a referee. The players cannot communicate but may share common random bits or a common quantum state. A referee sends an input $x_i$ to the $i^{th}$ player who then responds by sending an answer $a_i$ to the referee. The players win if the answers $a_i$ satisfy a condition that may depend on the inputs $x_i$. Typically, non-local games are studied in a framework where the referee picks the inputs from a known probability distribution. We initiate the study of non-local games in a worst-case scenario when the referee's probability distribution is unknown and study several non-local games in this scenario.

Arturs Backurs

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

Differentially Private Fine-tuning of Language Models

Differentially Private Model Compression

Active Local Learning

Submodular Clustering in Low Dimensions

Improving Viterbi is Hard: Better Runtimes Imply Faster Clique Algorithms

Tight Hardness Results for Maximum Weight Rectangles

Which Regular Expression Patterns are Hard to Match?

If the Current Clique Algorithms are Optimal, so is Valiant's Parser

Nearly-optimal bounds for sparse recovery in generic norms, with applications to $k$-median sketching

Quadratic-Time Hardness of LCS and other Sequence Similarity Measures

Subtree Isomorphism Revisited

Optimal quantum query bounds for almost all Boolean functions

Quantum strategies are better than classical in almost any XOR game

Search by quantum walks on two-dimensional grid without amplitude amplification

Worst case analysis of non-local games