Researcher profile

Arturs Backurs

Arturs Backurs contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 17 - UnverifiedVerification L1Unclaimed author
4works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2022arXiv

Differentially Private Fine-tuning of Language Models

We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially private adaptations of these approaches outperform previous private algorithms in three important dimensions: utility, privacy, and the computational and memory cost of private training. On many commonly studied datasets, the utility of private models approaches that of non-private models. For example, on the MNLI dataset we achieve an accuracy of $87.8\%$ using RoBERTa-Large and $83.5\%$ using RoBERTa-Base with a privacy budget of $ε= 6.7$. In comparison, absent privacy constraints, RoBERTa-Large achieves an accuracy of $90.2\%$. Our findings are similar for natural language generation tasks. Privately fine-tuning with DART, GPT-2-Small, GPT-2-Medium, GPT-2-Large, and GPT-2-XL achieve BLEU scores of 38.5, 42.0, 43.1, and 43.8 respectively (privacy budget of $ε= 6.8,δ=$ 1e-5) whereas the non-private baseline is $48.1$. All our experiments suggest that larger models are better suited for private fine-tuning: while they are well known to achieve superior accuracy non-privately, we find that they also better maintain their accuracy when privacy is introduced.

preprint2022arXiv

Differentially Private Model Compression

Recent papers have shown that large pre-trained language models (LLMs) such as BERT, GPT-2 can be fine-tuned on private data to achieve performance comparable to non-private models for many downstream Natural Language Processing (NLP) tasks while simultaneously guaranteeing differential privacy. The inference cost of these models -- which consist of hundreds of millions of parameters -- however, can be prohibitively large. Hence, often in practice, LLMs are compressed before they are deployed in specific applications. In this paper, we initiate the study of differentially private model compression and propose frameworks for achieving 50% sparsity levels while maintaining nearly full performance. We demonstrate these ideas on standard GLUE benchmarks using BERT models, setting benchmarks for future research on this topic.

preprint2020arXiv

Active Local Learning

In this work we consider active local learning: given a query point $x$, and active access to an unlabeled training set $S$, output the prediction $h(x)$ of a near-optimal $h \in H$ using significantly fewer labels than would be needed to actually learn $h$ fully. In particular, the number of label queries should be independent of the complexity of $H$, and the function $h$ should be well-defined, independent of $x$. This immediately also implies an algorithm for distance estimation: estimating the value $opt(H)$ from many fewer labels than needed to actually learn a near-optimal $h \in H$, by running local learning on a few random query points and computing the average error. For the hypothesis class consisting of functions supported on the interval $[0,1]$ with Lipschitz constant bounded by $L$, we present an algorithm that makes $O(({1 / ε^6}) \log(1/ε))$ label queries from an unlabeled pool of $O(({L / ε^4})\log(1/ε))$ samples. It estimates the distance to the best hypothesis in the class to an additive error of $ε$ for an arbitrary underlying distribution. We further generalize our algorithm to more than one dimensions. We emphasize that the number of labels used is independent of the complexity of the hypothesis class which depends on $L$. Furthermore, we give an algorithm to locally estimate the values of a near-optimal function at a few query points of interest with number of labels independent of $L$. We also consider the related problem of approximating the minimum error that can be achieved by the Nadaraya-Watson estimator under a linear diagonal transformation with eigenvalues coming from a small range. For a $d$-dimensional pointset of size $N$, our algorithm achieves an additive approximation of $ε$, makes $\tilde{O}({d}/{ε^2})$ queries and runs in $\tilde{O}({d^2}/{ε^{d+4}}+{dN}/{ε^2})$ time.

preprint2020arXiv

Submodular Clustering in Low Dimensions

We study a clustering problem where the goal is to maximize the coverage of the input points by $k$ chosen centers. Specifically, given a set of $n$ points $P \subseteq \mathbb{R}^d$, the goal is to pick $k$ centers $C \subseteq \mathbb{R}^d$ that maximize the service $ \sum_{p \in P}\mathsfφ\bigl( \mathsf{d}(p,C) \bigr) $ to the points $P$, where $\mathsf{d}(p,C)$ is the distance of $p$ to its nearest center in $C$, and $\mathsfφ$ is a non-increasing service function $\mathsfφ : \mathbb{R}^+ \to \mathbb{R}^+$. This includes problems of placing $k$ base stations as to maximize the total bandwidth to the clients -- indeed, the closer the client is to its nearest base station, the more data it can send/receive, and the target is to place $k$ base stations so that the total bandwidth is maximized. We provide an $n^{\varepsilon^{-O(d)}}$ time algorithm for this problem that achieves a $(1-\varepsilon)$-approximation. Notably, the runtime does not depend on the parameter $k$ and it works for an arbitrary non-increasing service function $\mathsfφ : \mathbb{R}^+ \to \mathbb{R}^+$.