Source author record

Paul Christiano

Paul Christiano appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Computation and Language Computational Complexity Computer Science and Game Theory cond-mat.dis-nn Data Structures and Algorithms Mathematical Software quant-ph Robotics Symbolic Computation Systems and Control

Catalog footprint

What is connected

14works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Estimating the expected output of wide random MLPs more efficiently than sampling

By far the most common way to estimate an expected loss in machine learning is to draw samples, compute the loss on each one, and take the empirical average. However, sampling is not necessarily optimal. Given an MLP at initialization, we show how to estimate its expected output over Gaussian inputs without running samples through the network at all. Instead, we produce approximate representations of the distributions of activations at each layer, leveraging tools such as cumulants and Hermite expansions. We show both theoretically and empirically that for sufficiently wide networks, our estimator achieves a target mean squared error using substantially fewer FLOPs than Monte Carlo sampling. We find moreover that our methods perform particularly well at estimating the probabilities of rare events, and additionally demonstrate how they can be used for model training. Together, these findings suggest a path to producing models with a greatly reduced probability of catastrophic tail risks.

preprint2024arXiv

Evaluating Language-Model Agents on Realistic Autonomous Tasks

In this report, we explore the ability of language model agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. We refer to this cluster of capabilities as "autonomous replication and adaptation" or ARA. We believe that systems capable of ARA could have wide-reaching and hard-to-anticipate consequences, and that measuring and forecasting ARA may be useful for informing measures around security, monitoring, and alignment. Additionally, once a system is capable of ARA, placing bounds on a system's capabilities may become significantly more difficult. We construct four simple example agents that combine language models with tools that allow them to take actions in the world. We then evaluate these agents on 12 tasks relevant to ARA. We find that these language model agents can only complete the easiest tasks from this list, although they make some progress on the more challenging tasks. Unfortunately, these evaluations are not adequate to rule out the possibility that near-future agents will be capable of ARA. In particular, we do not think that these evaluations provide good assurance that the ``next generation'' of language models (e.g. 100x effective compute scaleup on existing models) will not yield agents capable of ARA, unless intermediate evaluations are performed during pretraining. Relatedly, we expect that fine-tuning of the existing models could produce substantially more competent agents, even if the fine-tuning is not directly targeted at ARA.

preprint2022arXiv

Learning to summarize from human feedback

As language models become more powerful, training and evaluation are increasingly bottlenecked by the data and metrics used for a particular task. For example, summarization models are often trained to predict human reference summaries and evaluated using ROUGE, but both of these metrics are rough proxies for what we really care about -- summary quality. In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. We collect a large, high-quality dataset of human comparisons between summaries, train a model to predict the human-preferred summary, and use that model as a reward function to fine-tune a summarization policy using reinforcement learning. We apply our method to a version of the TL;DR dataset of Reddit posts and find that our models significantly outperform both human reference summaries and much larger models fine-tuned with supervised learning alone. Our models also transfer to CNN/DM news articles, producing summaries nearly as good as the human reference without any news-specific fine-tuning. We conduct extensive analyses to understand our human feedback dataset and fine-tuned models We establish that our reward model generalizes to new datasets, and that optimizing our reward model results in better summaries than optimizing ROUGE according to humans. We hope the evidence from our paper motivates machine learning researchers to pay closer attention to how their training loss affects the model behavior they actually want.

preprint2022arXiv

Training language models to follow instructions with human feedback

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

preprint2020arXiv

Fine-Tuning Language Models from Human Preferences

Reward learning enables the application of reinforcement learning (RL) to tasks where reward is defined by human judgment, building a model of reward by asking humans questions. Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks. In this paper, we build on advances in generative pretraining of language models to apply reward learning to four natural language tasks: continuing text with positive sentiment or physically descriptive language, and summarization tasks on the TL;DR and CNN/Daily Mail datasets. For stylistic continuation we achieve good results with only 5,000 comparisons evaluated by humans. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble; this leads to reasonable ROUGE scores and very good performance according to our human labelers, but may be exploiting the fact that labelers rely on simple heuristics.

preprint2016arXiv

A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models

Generative adversarial networks (GANs) are a recently proposed class of generative models in which a generator is trained to optimize a cost function that is being simultaneously learned by a discriminator. While the idea of learning cost functions is relatively new to the field of generative modeling, learning costs has long been studied in control and reinforcement learning (RL) domains, typically for imitation learning from demonstrations. In these fields, learning cost function underlying observed behavior is known as inverse reinforcement learning (IRL) or inverse optimal control. While at first the connection between cost learning in RL and cost learning in generative modeling may appear to be a superficial one, we show in this paper that certain IRL methods are in fact mathematically equivalent to GANs. In particular, we demonstrate an equivalence between a sample-based algorithm for maximum entropy IRL and a GAN in which the generator's density can be evaluated and is provided as an additional input to the discriminator. Interestingly, maximum entropy IRL is a special case of an energy-based model. We discuss the interpretation of GANs as an algorithm for training energy-based models, and relate this interpretation to other recent work that seeks to connect GANs and EBMs. By formally highlighting the connection between GANs, IRL, and EBMs, we hope that researchers in all three communities can better identify and apply transferable ideas from one domain to another, particularly for developing more stable and scalable algorithms: a major challenge in all three domains.

preprint2016arXiv

Collaborative prediction with expert advice

Many practical learning systems aggregate data across many users, while learning theory traditionally considers a single learner who trusts all of their observations. A case in point is the foundational learning problem of prediction with expert advice. To date, there has been no theoretical study of the general collaborative version of prediction with expert advice, in which many users face a similar problem and would like to share their experiences in order to learn faster. A key issue in this collaborative framework is robustness: generally algorithms that aggregate data are vulnerable to manipulation by even a small number of dishonest users. We exhibit the first robust collaborative algorithm for prediction with expert advice. When all users are honest and have similar tastes our algorithm matches the performance of pooling data and using a traditional algorithm. But our algorithm also guarantees that adding users never significantly degrades performance, even if the additional users behave adversarially. We achieve strong guarantees even when the overwhelming majority of users behave adversarially. As a special case, our algorithm is extremely robust to variation amongst the users.

preprint2016arXiv

Concrete Problems in AI Safety

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society. In this paper we discuss one such potential impact: the problem of accidents in machine learning systems, defined as unintended and harmful behavior that may emerge from poor design of real-world AI systems. We present a list of five practical research problems related to accident risk, categorized according to whether the problem originates from having the wrong objective function ("avoiding side effects" and "avoiding reward hacking"), an objective function that is too expensive to evaluate frequently ("scalable supervision"), or undesirable behavior during the learning process ("safe exploration" and "distributional shift"). We review previous work in these areas as well as suggesting research directions with a focus on relevance to cutting-edge AI systems. Finally, we consider the high-level question of how to think most productively about the safety of forward-looking applications of AI.

preprint2016arXiv

Theano: A Python framework for fast computation of mathematical expressions

Theano is a Python library that allows to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements. Theano is being actively and continuously developed since 2008, multiple frameworks have been built on top of it and it has been used to produce many state-of-the-art machine learning models. The present article is structured as follows. Section I provides an overview of the Theano software and its community. Section II presents the principal features of Theano and how to use them, and compares them with other similar projects. Section III focuses on recently-introduced functionalities and improvements. Section IV compares the performance of Theano against Torch7 and TensorFlow on several machine learning models. Section V discusses current limitations of Theano and potential ways of improving it.

preprint2016arXiv

Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model

Developing control policies in simulation is often more practical and safer than directly running experiments in the real world. This applies to policies obtained from planning and optimization, and even more so to policies obtained from reinforcement learning, which is often very data demanding. However, a policy that succeeds in simulation often doesn't work when deployed on a real robot. Nevertheless, often the overall gist of what the policy does in simulation remains valid in the real world. In this paper we investigate such settings, where the sequence of states traversed in simulation remains reasonable for the real world, even if the details of the controls are not, as could be the case when the key differences lie in detailed friction, contact, mass and geometry properties. During execution, at each time step our approach computes what the simulation-based control policy would do, but then, rather than executing these controls on the real robot, our approach computes what the simulation expects the resulting next state(s) will be, and then relies on a learned deep inverse dynamics model to decide which real-world action is most suitable to achieve those next states. Deep models are only as good as their training data, and we also propose an approach for data collection to (incrementally) learn the deep inverse dynamics model. Our experiments shows our approach compares favorably with various baselines that have been developed for dealing with simulation to real world model discrepancy, including output error control and Gaussian dynamics adaptation.

preprint2014arXiv

Online Local Learning via Semidefinite Programming

In many online learning problems we are interested in predicting local information about some universe of items. For example, we may want to know whether two items are in the same cluster rather than computing an assignment of items to clusters; we may want to know which of two teams will win a game rather than computing a ranking of teams. Although finding the optimal clustering or ranking is typically intractable, it may be possible to predict the relationships between items as well as if you could solve the global optimization problem exactly. Formally, we consider an online learning problem in which a learner repeatedly guesses a pair of labels (l(x), l(y)) and receives an adversarial payoff depending on those labels. The learner's goal is to receive a payoff nearly as good as the best fixed labeling of the items. We show that a simple algorithm based on semidefinite programming can obtain asymptotically optimal regret in the case where the number of possible labels is O(1), resolving an open problem posed by Hazan, Kale, and Shalev-Schwartz. Our main technical contribution is a novel use and analysis of the log determinant regularizer, exploiting the observation that log det(A + I) upper bounds the entropy of any distribution with covariance matrix A.

preprint2014arXiv

Provably Manipulation-Resistant Reputation Systems

We consider a community of users who must make periodic decisions about whether to interact with one another. We propose a protocol which allows honest users to reliably interact with each other, while limiting the damage done by each malicious or incompetent user. The worst-case cost per user is sublinear in the average number of interactions per user and is independent of the number of users. Our guarantee holds simultaneously for every group of honest users. For example, multiple groups of users with incompatible tastes or preferences can coexist. As a motivating example, we consider a game where players have periodic opportunities to do one another favors but minimal ability to determine when a favor was done. In this setting, our protocol achieves nearly optimal collective welfare while remaining resistant to exploitation. Our results also apply to a collaborative filtering setting where users must make periodic decisions about whether to interact with resources such as movies or restaurants. In this setting, we guarantee that any set of honest users achieves a payoff nearly as good as if they had identified the optimal set of items in advance and then chosen to interact only with resources from that set.

preprint2012arXiv

Quantum Money from Hidden Subspaces

Forty years ago, Wiesner pointed out that quantum mechanics raises the striking possibility of money that cannot be counterfeited according to the laws of physics. We propose the first quantum money scheme that is (1) public-key, meaning that anyone can verify a banknote as genuine, not only the bank that printed it, and (2) cryptographically secure, under a "classical" hardness assumption that has nothing to do with quantum money. Our scheme is based on hidden subspaces, encoded as the zero-sets of random multivariate polynomials. A main technical advance is to show that the "black-box" version of our scheme, where the polynomials are replaced by classical oracles, is unconditionally secure. Previously, such a result had only been known relative to a quantum oracle (and even there, the proof was never published). Even in Wiesner's original setting -- quantum money that can only be verified by the bank -- we are able to use our techniques to patch a major security hole in Wiesner's scheme. We give the first private-key quantum money scheme that allows unlimited verifications and that remains unconditionally secure, even if the counterfeiter can interact adaptively with the bank. Our money scheme is simpler than previous public-key quantum money schemes, including a knot-based scheme of Farhi et al. The verifier needs to perform only two tests, one in the standard basis and one in the Hadamard basis -- matching the original intuition for quantum money, based on the existence of complementary observables. Our security proofs use a new variant of Ambainis's quantum adversary method, and several other tools that might be of independent interest.

preprint2010arXiv

Electrical Flows, Laplacian Systems, and Faster Approximation of Maximum Flow in Undirected Graphs

We introduce a new approach to computing an approximately maximum s-t flow in a capacitated, undirected graph. This flow is computed by solving a sequence of electrical flow problems. Each electrical flow is given by the solution of a system of linear equations in a Laplacian matrix, and thus may be approximately computed in nearly-linear time. Using this approach, we develop the fastest known algorithm for computing approximately maximum s-t flows. For a graph having n vertices and m edges, our algorithm computes a (1-ε)-approximately maximum s-t flow in time \tilde{O}(mn^{1/3} ε^{-11/3}). A dual version of our approach computes a (1+ε)-approximately minimum s-t cut in time \tilde{O}(m+n^{4/3}\eps^{-8/3}), which is the fastest known algorithm for this problem as well. Previously, the best dependence on m and n was achieved by the algorithm of Goldberg and Rao (J. ACM 1998), which can be used to compute approximately maximum s-t flows in time \tilde{O}(m\sqrt{n}ε^{-1}), and approximately minimum s-t cuts in time \tilde{O}(m+n^{3/2}ε^{-3}).

Paul Christiano

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

Estimating the expected output of wide random MLPs more efficiently than sampling

Evaluating Language-Model Agents on Realistic Autonomous Tasks

Learning to summarize from human feedback

Training language models to follow instructions with human feedback

Fine-Tuning Language Models from Human Preferences

A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models

Collaborative prediction with expert advice

Concrete Problems in AI Safety

Theano: A Python framework for fast computation of mathematical expressions

Transfer from Simulation to Real World through Learning Deep Inverse Dynamics Model

Online Local Learning via Semidefinite Programming

Provably Manipulation-Resistant Reputation Systems

Quantum Money from Hidden Subspaces

Electrical Flows, Laplacian Systems, and Faster Approximation of Maximum Flow in Undirected Graphs