Source author record

Roy Schwartz

Roy Schwartz appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Data Structures and Algorithms Machine Learning Artificial Intelligence Computational Complexity Formal Languages and Automata Theory

Catalog footprint

What is connected

16works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

ABC: Attention with Bounded-memory Control

Transformer architectures have achieved state-of-the-art results on a variety of sequence modeling tasks. However, their attention mechanism comes with a quadratic complexity in sequence lengths, making the computational overhead prohibitive, especially for long sequences. Attention context can be seen as a random-access memory with each token taking a slot. Under this perspective, the memory size grows linearly with the sequence length, and so does the overhead of reading from it. One way to improve the efficiency is to bound the memory size. We show that disparate approaches can be subsumed into one abstraction, attention with bounded-memory control (ABC), and they vary in their organization of the memory. ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart. Second, this abstraction gives new insights--an established approach (Wang et al., 2020b) previously thought to be not applicable in causal attention, actually is. Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their heuristic memory-organizing functions with a learned, contextualized one. Our experiments on language modeling, machine translation, and masked language model finetuning show that our approach outperforms previous efficient attention models; compared to the strong transformer baselines, it significantly improves the inference time and space efficiency with no or negligible accuracy loss.

preprint2022arXiv

Data Contamination: From Memorization to Exploitation

Pretrained language models are typically trained on massive web-based datasets, which are often "contaminated" with downstream test sets. It is not clear to what extent models exploit the contaminated data for downstream tasks. We present a principled method to study this question. We pretrain BERT models on joint corpora of Wikipedia and labeled downstream datasets, and fine-tune them on the relevant task. Comparing performance between samples seen and unseen during pretraining enables us to define and quantify levels of memorization and exploitation. Experiments with two models and three downstream tasks show that exploitation exists in some cases, but in others the models memorize the contaminated data, but do not exploit it. We show that these two measures are affected by different factors such as the number of duplications of the contaminated data and the model size. Our results highlight the importance of analyzing massive web-scale datasets to verify that progress in NLP is obtained by better language understanding and not better data exploitation.

preprint2022arXiv

Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias

The size of pretrained models is increasing, and so is their performance on a variety of NLP tasks. However, as their memorization capacity grows, they might pick up more social biases. In this work, we examine the connection between model size and its gender bias (specifically, occupational gender bias). We measure bias in three masked language model families (RoBERTa, DeBERTa, and T5) in two setups: directly using prompt based method, and using a downstream task (Winogender). We find on the one hand that larger models receive higher bias scores on the former task, but when evaluated on the latter, they make fewer gender errors. To examine these potentially conflicting results, we carefully investigate the behavior of the different models on Winogender. We find that while larger models outperform smaller ones, the probability that their mistakes are caused by gender bias is higher. Moreover, we find that the proportion of stereotypical errors compared to anti-stereotypical ones grows with the model size. Our findings highlight the potential risks that can arise from increasing model size.

preprint2022arXiv

Measuring the Carbon Intensity of AI in Cloud Instances

By providing unprecedented access to computational resources, cloud computing has enabled rapid growth in technologies such as machine learning, the computational demands of which incur a high energy cost and a commensurate carbon footprint. As a result, recent scholarship has called for better estimates of the greenhouse gas impact of AI: data scientists today do not have easy or reliable access to measurements of this information, precluding development of actionable tactics. Cloud providers presenting information about software carbon intensity to users is a fundamental stepping stone towards minimizing emissions. In this paper, we provide a framework for measuring software carbon intensity, and propose to measure operational carbon emissions by using location-based and time-specific marginal emissions data per energy unit. We provide measurements of operational software carbon intensity for a set of modern models for natural language processing and computer vision, and a wide range of model sizes, including pretraining of a 6.1 billion parameter language model. We then evaluate a suite of approaches for reducing emissions on the Microsoft Azure cloud compute platform: using cloud instances in different geographic regions, using cloud instances at different times of day, and dynamically pausing cloud instances when the marginal carbon intensity is above a certain threshold. We confirm previous results that the geographic region of the data center plays a significant role in the carbon intensity for a given cloud instance, and find that choosing an appropriate region can have the largest operational emissions reduction impact. We also show that the time of day has notable impact on operational software carbon intensity. Finally, we conclude with recommendations for how machine learning practitioners can use software carbon intensity information to reduce environmental impact.

preprint2022arXiv

On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

Recent work has shown that deep learning models in NLP are highly sensitive to low-level correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out "easy" instances (Sakaguchi et al., 2020), culminating in a recent proposal to eliminate single-word correlations altogether (Gardner et al., 2021). In this opinion paper, we identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to "throw the baby out with the bathwater" and miss important signal encoding common sense and world knowledge. We highlight several alternatives to dataset balancing, focusing on enhancing datasets with richer contexts, allowing models to abstain and interact with users, and turning from large-scale fine-tuning to zero- or few-shot setups.

preprint2022arXiv

TangoBERT: Reducing Inference Cost by using Cascaded Architecture

The remarkable success of large transformer-based models such as BERT, RoBERTa and XLNet in many NLP tasks comes with a large increase in monetary and environmental cost due to their high computational load and energy consumption. In order to reduce this computational load in inference time, we present TangoBERT, a cascaded model architecture in which instances are first processed by an efficient but less accurate first tier model, and only part of those instances are additionally processed by a less efficient but more accurate second tier model. The decision of whether to apply the second tier model is based on a confidence score produced by the first tier model. Our simple method has several appealing practical advantages compared to standard cascading approaches based on multi-layered transformer models. First, it enables higher speedup gains (average lower latency). Second, it takes advantage of batch size optimization for cascading, which increases the relative inference cost reductions. We report TangoBERT inference CPU speedup on four text classification GLUE tasks and on one reading comprehension task. Experimental results show that TangoBERT outperforms efficient early exit baseline models; on the the SST-2 task, it achieves an accuracy of 93.9% with a CPU speedup of 8.2x.

preprint2020arXiv

A Formal Hierarchy of RNN Architectures

We develop a formal hierarchy of the expressive capacity of RNN architectures. The hierarchy is based on two formal properties: space complexity, which measures the RNN's memory, and rational recurrence, defined as whether the recurrent update can be described by a weighted finite-state machine. We place several RNN variants within this hierarchy. For example, we prove the LSTM is not rational, which formally separates it from the related QRNN (Bradbury et al., 2016). We also show how these models' expressive capacity is expanded by stacking multiple layers or composing them with different pooling functions. Our results build on the theory of "saturated" RNNs (Merrill, 2019). While formally extending these findings to unsaturated RNNs is left to future work, we hypothesize that the practical learnable capacity of unsaturated RNNs obeys a similar hierarchy. Experimental findings from training unsaturated networks on formal languages support this conjecture.

preprint2020arXiv

A Mixture of $h-1$ Heads is Better than $h$ Heads

Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention heads can be pruned without significant performance loss. In this work, we instead "reallocate" them -- the model learns to activate different heads on different inputs. Drawing connections between multi-head attention and mixture of experts, we propose the mixture of attentive experts model (MAE). MAE is trained using a block coordinate descent algorithm that alternates between updating (1) the responsibilities of the experts and (2) their parameters. Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks. Particularly, on the WMT14 English to German translation dataset, MAE improves over "transformer-base" by 0.8 BLEU, with a comparable number of parameters. Our analysis shows that our model learns to specialize different experts to different inputs.

preprint2020arXiv

Fair Correlation Clustering

In this paper we study the problem of correlation clustering under fairness constraints. In the classic correlation clustering problem, we are given a complete graph where each edge is labeled positive or negative. The goal is to obtain a clustering of the vertices that minimizes disagreements -- the number of negative edges trapped inside a cluster plus positive edges between different clusters. We consider two variations of fairness constraint for the problem of correlation clustering where each node has a color, and the goal is to form clusters that do not over-represent vertices of any color. The first variant aims to generate clusters with minimum disagreements, where the distribution of a feature (e.g. gender) in each cluster is same as the global distribution. For the case of two colors when the desired ratio of the number of colors in each cluster is $1:p$, we get $\mathcal{O}(p^2)$-approximation algorithm. Our algorithm could be extended to the case of multiple colors. We prove this problem is NP-hard. The second variant considers relative upper and lower bounds on the number of nodes of any color in a cluster. The goal is to avoid violating upper and lower bounds corresponding to each color in each cluster while minimizing the total number of disagreements. Along with our theoretical results, we show the effectiveness of our algorithm to generate fair clusters by empirical evaluation on real world data sets.

preprint2020arXiv

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. This process, however, is often brittle: even with the same hyperparameter values, distinct random seeds can lead to substantially different results. To better understand this phenomenon, we experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials. Further, we examine two factors influenced by the choice of random seed: weight initialization and training data order. We find that both contribute comparably to the variance of out-of-sample performance, and that some weight initializations perform well across all tasks explored. On small datasets, we observe that many fine-tuning trials diverge part of the way through training, and we offer best practices for practitioners to stop training less promising runs early. We publicly release all of our experimental data, including training and validation scores for 2,100 trials, to encourage further analysis of training dynamics during fine-tuning.

preprint2020arXiv

The Right Tool for the Job: Matching Model and Instance Complexities

As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs. To better respect a given inference budget, we propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit" from neural network calculations for simple instances, and late (and accurate) exit for hard instances. To achieve this, we add classifiers to different layers of BERT and use their calibrated confidence scores to make early exit decisions. We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks. Our method presents a favorable speed/accuracy tradeoff in almost all cases, producing models which are up to five times faster than the state of the art, while preserving their accuracy. Our method also requires almost no additional training resources (in either time or parameters) compared to the baseline BERT model. Finally, our method alleviates the need for costly retraining of multiple models at different levels of efficiency; we allow users to control the inference speed/accuracy tradeoff using a single trained model, by setting a single variable at inference time. We publicly release our code.

preprint2015arXiv

Online Submodular Maximization with Preemption

Submodular function maximization has been studied extensively in recent years under various constraints and models. The problem plays a major role in various disciplines. We study a natural online variant of this problem in which elements arrive one-by-one and the algorithm has to maintain a solution obeying certain constraints at all times. Upon arrival of an element, the algorithm has to decide whether to accept the element into its solution and may preempt previously chosen elements. The goal is to maximize a submodular function over the set of elements in the solution. We study two special cases of this general problem and derive upper and lower bounds on the competitive ratio. Specifically, we design a $1/e$-competitive algorithm for the unconstrained case in which the algorithm may hold any subset of the elements, and constant competitive ratio algorithms for the case where the algorithm may hold at most $k$ elements in its solution.

preprint2014arXiv

Comparing Apples and Oranges: Query Tradeoff in Submodular Maximization

Fast algorithms for submodular maximization problems have a vast potential use in applicative settings, such as machine learning, social networks, and economics. Though fast algorithms were known for some special cases, only recently Badanidiyuru and Vondrák (2014) were the first to explicitly look for such algorithms in the general case of maximizing a monotone submodular function subject to a matroid independence constraint. The algorithm of Badanidiyuru and Vondrák matches the best possible approximation guarantee, while trying to reduce the number of value oracle queries the algorithm performs. Our main result is a new algorithm for this general case which establishes a surprising tradeoff between two seemingly unrelated quantities: the number of value oracle queries and the number of matroid independence queries performed by the algorithm. Specifically, one can decrease the former by increasing the latter and vice versa, while maintaining the best possible approximation guarantee. Such a tradeoff is very useful since various applications might incur significantly different costs in querying the value and matroid independence oracles. Furthermore, in case the rank of the matroid is $O(n^c)$, where $n$ is the size of the ground set and $c$ is an absolute constant smaller than $1$, the total number of oracle queries our algorithm uses can be made to have a smaller magnitude compared to that needed by Badanidiyuru and Vondrák. We also provide even faster algorithms for the well studied special cases of a cardinality constraint and a partition matroid independence constraint, both of which capture many real-world applications and have been widely studied both theorically and in practice.

preprint2013arXiv

On the Approximation of Submodular Functions

Submodular functions are a fundamental object of study in combinatorial optimization, economics, machine learning, etc. and exhibit a rich combinatorial structure. Many subclasses of submodular functions have also been well studied and these subclasses widely vary in their complexity. Our motivation is to understand the relative complexity of these classes of functions. Towards this, we consider the question of how well can one class of submodular functions be approximated by another (simpler) class of submodular functions. Such approximations naturally allow algorithms designed for the simpler class to be applied to the bigger class of functions. We prove both upper and lower bounds on such approximations. Our main results are: 1. General submodular functions can be approximated by cut functions of directed graphs to a factor of $n^2/4$, which is tight. 2. General symmetric submodular functions$^{1}$ can be approximated by cut functions of undirected graphs to a factor of $n-1$, which is tight up to a constant. 3. Budgeted additive functions can be approximated by coverage functions to a factor of $e/(e-1)$, which is tight. Here $n$ is the size of the ground set on which the submodular function is defined. We also observe that prior works imply that monotone submodular functions can be approximated by coverage functions with a factor between $O(\sqrt{n} \log n)$ and $Ω(n^{1/3} /\log^2 n) $.

preprint2012arXiv

On the Furthest Hyperplane Problem and Maximal Margin Clustering

This paper introduces the Furthest Hyperplane Problem (FHP), which is an unsupervised counterpart of Support Vector Machines. Given a set of n points in Rd, the objective is to produce the hyperplane (passing through the origin) which maximizes the separation margin, that is, the minimal distance between the hyperplane and any input point. To the best of our knowledge, this is the first paper achieving provable results regarding FHP. We provide both lower and upper bounds to this NP-hard problem. First, we give a simple randomized algorithm whose running time is n^O(1/θ^2) where θ is the optimal separation margin. We show that its exponential dependency on 1/θ^2 is tight, up to sub-polynomial factors, assuming SAT cannot be solved in sub-exponential time. Next, we give an efficient approxima- tion algorithm. For any α \in [0, 1], the algorithm produces a hyperplane whose distance from at least 1 - 5α fraction of the points is at least α times the optimal separation margin. Finally, we show that FHP does not admit a PTAS by presenting a gap preserving reduction from a particular version of the PCP theorem.

preprint2011arXiv

Min-Max Graph Partitioning and Small Set Expansion

We study graph partitioning problems from a min-max perspective, in which an input graph on n vertices should be partitioned into k parts, and the objective is to minimize the maximum number of edges leaving a single part. The two main versions we consider are where the k parts need to be of equal-size, and where they must separate a set of k given terminals. We consider a common generalization of these two problems, and design for it an $O(\sqrt{\log n\log k})$-approximation algorithm. This improves over an $O(\log^2 n)$ approximation for the second version, and roughly $O(k\log n)$ approximation for the first version that follows from other previous work. We also give an improved O(1)-approximation algorithm for graphs that exclude any fixed minor. Our algorithm uses a new procedure for solving the Small-Set Expansion problem. In this problem, we are given a graph G and the goal is to find a non-empty set $S\subseteq V$ of size $|S| \leq ρn$ with minimum edge-expansion. We give an $O(\sqrt{\log{n}\log{(1/ρ)}})$ bicriteria approximation algorithm for the general case of Small-Set Expansion, and O(1) approximation algorithm for graphs that exclude any fixed minor.

Roy Schwartz

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

ABC: Attention with Bounded-memory Control

Data Contamination: From Memorization to Exploitation

Fewer Errors, but More Stereotypes? The Effect of Model Size on Gender Bias

Measuring the Carbon Intensity of AI in Cloud Instances

On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

TangoBERT: Reducing Inference Cost by using Cascaded Architecture

A Formal Hierarchy of RNN Architectures

A Mixture of $h-1$ Heads is Better than $h$ Heads

Fair Correlation Clustering

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

The Right Tool for the Job: Matching Model and Instance Complexities

Online Submodular Maximization with Preemption

Comparing Apples and Oranges: Query Tradeoff in Submodular Maximization

On the Approximation of Submodular Functions

On the Furthest Hyperplane Problem and Maximal Margin Clustering

Min-Max Graph Partitioning and Small Set Expansion