Source author record

Aitor Lewkowycz

Aitor Lewkowycz appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

hep-th Machine Learning Computation and Language Artificial Intelligence gr-qc quant-ph cond-mat.stat-mech Neural and Evolutionary Computing

Catalog footprint

What is connected

17works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Language Model Cascades

Prompted models have demonstrated impressive few-shot learning abilities. Repeated interactions at test-time with a single model, or the composition of multiple models together, further expands capabilities. These compositions are probabilistic models, and may be expressed in the language of graphical models with random variables whose values are complex data types such as strings. Cases with control flow and dynamic structure require techniques from probabilistic programming, which allow implementing disparate model structures and inference strategies in a unified language. We formalize several existing techniques from this perspective, including scratchpads / chain of thought, verifiers, STaR, selection-inference, and tool use. We refer to the resulting programs as language model cascades.

preprint2022arXiv

Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$

Recent neural network-based language models have benefited greatly from scaling up the size of training datasets and the number of parameters in the models themselves. Scaling can be complicated due to various factors including the need to distribute computation on supercomputer clusters (e.g., TPUs), prevent bottlenecks when infeeding data, and ensure reproducible results. In this work, we present two software libraries that ease these issues: $\texttt{t5x}$ simplifies the process of building and training large language models at scale while maintaining ease of use, and $\texttt{seqio}$ provides a task-based API for simple creation of fast and reproducible training data and evaluation pipelines. These open-source libraries have been used to train models with hundreds of billions of parameters on datasets with multiple terabytes of training data. Along with the libraries, we release configurations and instructions for T5-like encoder-decoder models as well as GPT-like decoder-only architectures. $\texttt{t5x}$ and $\texttt{seqio}$ are open source and available at https://github.com/google-research/t5x and https://github.com/google/seqio, respectively.

preprint2022arXiv

Solving Quantitative Reasoning Problems with Language Models

Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva, a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a third of them.

preprint2021arXiv

On the training dynamics of deep networks with $L_2$ regularization

We study the role of $L_2$ regularization in deep learning, and uncover simple relations between the performance of the model, the $L_2$ coefficient, the learning rate, and the number of training steps. These empirical relations hold when the network is overparameterized. They can be used to predict the optimal regularization parameter of a given model. In addition, based on these observations we propose a dynamical schedule for the regularization parameter that improves performance and speeds up training. We test these proposals in modern image classification settings. Finally, we show that these empirical relations can be understood theoretically in the context of infinitely wide networks. We derive the gradient flow dynamics of such networks, and compare the role of $L_2$ regularization in this context with that of linear models.

preprint2021arXiv

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating realistic text or synthesizing computer programs. However, they struggle with tasks that require unbounded multi-step computation, such as adding integers or executing programs. Surprisingly, we find that these same models are able to perform complex multi-step computations -- even in the few-shot regime -- when asked to perform the operation "step by step", showing the results of intermediate computations. In particular, we train transformers to perform multi-step computations by asking them to emit intermediate computation steps into a "scratchpad". On a series of increasingly complex tasks ranging from long addition to the execution of arbitrary programs, we show that scratchpads dramatically improve the ability of language models to perform multi-step computations.

preprint2020arXiv

The large learning rate phase of deep learning: the catapult mechanism

The choice of initial learning rate can have a profound effect on the performance of deep networks. We present a class of neural networks with solvable training dynamics, and confirm their predictions empirically in practical deep learning settings. The networks exhibit sharply distinct behaviors at small and large learning rates. The two regimes are separated by a phase transition. In the small learning rate phase, training can be understood using the existing theory of infinitely wide neural networks. At large learning rates the model captures qualitatively distinct phenomena, including the convergence of gradient descent dynamics to flatter minima. One key prediction of our model is a narrow range of large, stable learning rates. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings. Furthermore, we find that the optimal performance in such settings is often found in the large learning rate phase. We believe our results shed light on characteristics of models trained at different learning rates. In particular, they fill a gap between existing wide neural network theory, and the nonlinear, large learning rate, training dynamics relevant to practice.

preprint2016arXiv

Deriving covariant holographic entanglement

We provide a gravitational argument in favour of the covariant holographic entanglement entropy proposal. In general time-dependent states, the proposal asserts that the entanglement entropy of a region in the boundary field theory is given by a quarter of the area of a bulk extremal surface in Planck units. The main element of our discussion is an implementation of an appropriate Schwinger-Keldysh contour to obtain the reduced density matrix (and its powers) of a given region, as is relevant for the replica construction. We map this contour into the bulk gravitational theory, and argue that the saddle point solutions of these replica geometries lead to a consistent prescription for computing the field theory Renyi entropies. In the limiting case where the replica index is taken to unity, a local analysis suffices to show that these saddles lead to the extremal surfaces of interest. We also comment on various properties of holographic entanglement that follow from this construction.

preprint2015arXiv

Relative entropy equals bulk relative entropy

We consider the gravity dual of the modular Hamiltonian associated to a general subregion of a boundary theory. We use it to argue that the relative entropy of nearby states is given by the relative entropy in the bulk, to leading order in the bulk gravitational coupling. We also argue that the boundary modular flow is dual to the bulk modular flow in the entanglement wedge, with implications for entanglement wedge reconstruction.

preprint2014arXiv

Exact results for the entanglement entropy and the energy radiated by a quark

We consider a spherical region with a heavy quark in the middle. We compute the extra entanglement entropy due to the presence of a heavy quark both in ${\cal N}=4 $ Super Yang Mills and in the ${\cal N}=6$ Chern-Simons matter theory (ABJM). This is done by relating the computation to the expectation value of a circular Wilson loop and a stress tensor insertion. We also give an exact expression for the Bremsstrahlung function that determines the energy radiated by a quark in the ABJM theory.

preprint2014arXiv

Renyi entropy, stationarity, and entanglement of the conformal scalar

We extend previous work on the perturbative expansion of the Renyi entropy, $S_q$, around $q=1$ for a spherical entangling surface in a general CFT. Applied to conformal scalar fields in various spacetime dimensions, the results appear to conflict with the known conformal scalar Renyi entropies. On the other hand, the perturbative results agree with known Renyi entropies in a variety of other theories, including theories of free fermions and vector fields and theories with Einstein gravity duals. We propose a resolution stemming from a careful consideration of boundary conditions near the entangling surface. This is equivalent to a proper treatment of total-derivative terms in the definition of the modular Hamiltonian. As a corollary, we are able to resolve an outstanding puzzle in the literature regarding the Renyi entropy of ${\cal N}=4$ super-Yang-Mills near $q=1$. A related puzzle regards the question of stationarity of the renormalized entanglement entropy (REE) across a circle for a (2+1)-dimensional massive scalar field. We point out that the boundary contributions to the modular Hamiltonian shed light on the previously-observed non-stationarity. Moreover, IR divergences appear in perturbation theory about the massless fixed point that inhibit our ability to reliably calculate the REE at small non-zero mass.

preprint2014arXiv

Universality in the geometric dependence of Renyi entropy

We derive several new results for Renyi entropy, $S_n$, across generic entangling surfaces. We establish a perturbative expansion of the Renyi entropy, valid in generic quantum field theories, in deformations of a given density matrix. When applied to even-dimensional conformal field theories, these results lead to new constraints on the $n$-dependence, independent of any perturbative expansion. In 4d CFTs, we show that the $n$-dependence of the universal part of the ground state Renyi entropy for entangling surfaces with vanishing extrinsic curvature contribution is in fact fully determined by the Renyi entropy across a sphere in flat space. Using holography, we thus provide the first computations of Renyi entropy across non-spherical entangling surfaces in strongly coupled 4d CFTs. Furthermore, we address the possibility that in a wide class of 4d CFTs, the flat space spherical Renyi entropy also fixes the $n$-dependence of the extrinsic curvature contribution, and hence that of arbitrary entangling surfaces. Our results have intriguing implications for the structure of generic modular Hamiltonians.

preprint2013arXiv

Generalized gravitational entropy

We consider classical Euclidean gravity solutions with a boundary. The boundary contains a non-contractible circle. These solutions can be interpreted as computing the trace of a density matrix in the full quantum gravity theory, in the classical approximation. When the circle is contractible in the bulk, we argue that the entropy of this density matrix is given by the area of a minimal surface. This is a generalization of the usual black hole entropy formula to euclidean solutions without a Killing vector. A particular example of this set up appears in the computation of the entanglement entropy of a subregion of a field theory with a gravity dual. In this context, the minimal area prescription was proposed by Ryu and Takayanagi. Our arguments explain their conjecture.

preprint2013arXiv

Quantum corrections to holographic entanglement entropy

We consider entanglement entropy in quantum field theories with a gravity dual. In the gravity description, the leading order contribution comes from the area of a minimal surface, as proposed by Ryu-Takayanagi. Here we describe the one loop correction to this formula. The minimal surface divides the bulk into two regions. The bulk loop correction is essentially given by the bulk entanglement entropy between these two bulk regions. We perform some simple checks of this proposal.

preprint2012arXiv

Exact results for static and radiative fields of a quark in N=4 super Yang-Mills

In this work (which supersedes our previous preprint arXiv:1112.2345) we determine the expectation value of the N=4$ SU(N) SYM Lagrangian density operator in the presence of an infinitely heavy static particle in the symmetric representation of SU(N), by means of a D3-brane probe computation. The result that we obtain coincides with two previous computations of different observables, up to kinematical factors. We argue that these agreements go beyond the D-brane probe approximation, which leads us to propose an exact formula for the expectation value of various operators. In particular, we provide an expression for the total energy loss by radiation of a heavy particle in the fundamental representation.

preprint2012arXiv

Gluonic fields of a static particle to all orders in 1/N

We determine the expectation value of the gauge invariant operator Tr [F^2+... ] for N=4 SU(N) SYM, in the presence of an infinitely heavy static particle in the symmetric representation of SU(N). We carry out the computation in the context of the AdS/CFT correspondence, by considering the perturbation of the dilaton field caused by the presence of a D3 brane dual to such an external probe. We find that the effective chromo-electric charge of the probe has exactly the same expression as the one recently found in the computation of energy loss by radiation.

preprint2012arXiv

Holographic Entanglement Entropy and Confinement

We study the phase transition in the holographic entanglement entropy for various confining models. This transition occurs for the entanglement entropy of a strip at a critical value of the strip width. Our main interest is to examine the critical width for models with several parameters. For these models, the critical width, the glueball mass and the string tension all become functions of these two parameters. Comparing the behavior of the critical width in the entanglement entropy and these other scales, we find that $l_c$ seems to follow closely the deconfinement temperature and the glueball mass. The behavior of the string tension is similar to $l_c$, despite of being parametrically smaller than the other quantities.

preprint2012arXiv

Observations on entanglement entropy in massive QFT's

We identify various universal contributions to the entanglement entropy for massive free fields. As well as the `area' terms found in [1], we find other geometric contributions of the form discussed in [2]. We also compute analogous contributions for a strongly coupled field theory using the AdS/CFT correspondence. In this case, we find the results for strong and weak coupling do not agree.

Aitor Lewkowycz

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

Language Model Cascades

Scaling Up Models and Data with $\texttt{t5x}$ and $\texttt{seqio}$

Solving Quantitative Reasoning Problems with Language Models

On the training dynamics of deep networks with $L_2$ regularization

Show Your Work: Scratchpads for Intermediate Computation with Language Models

The large learning rate phase of deep learning: the catapult mechanism

Deriving covariant holographic entanglement

Relative entropy equals bulk relative entropy

Exact results for the entanglement entropy and the energy radiated by a quark

Renyi entropy, stationarity, and entanglement of the conformal scalar

Universality in the geometric dependence of Renyi entropy

Generalized gravitational entropy

Quantum corrections to holographic entanglement entropy

Exact results for static and radiative fields of a quark in N=4 super Yang-Mills

Gluonic fields of a static particle to all orders in 1/N

Holographic Entanglement Entropy and Confinement

Observations on entanglement entropy in massive QFT's