Source author record

Jan Leike

Jan Leike appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Machine Learning Logic in Computer Science math.ST Statistics Theory Computation and Language Computer Science and Game Theory Cryptography and Security math.PR

Catalog footprint

What is connected

24works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

We introduce enhanced Constitutional Classifiers that deliver production-grade jailbreak robustness with dramatically reduced computational costs and refusal rates compared to previous-generation defenses. Our system combines several key insights. First, we develop exchange classifiers that evaluate model responses in their full conversational context, which addresses vulnerabilities in last-generation systems that examine outputs in isolation. Second, we implement a two-stage classifier cascade where lightweight classifiers screen all traffic and escalate only suspicious exchanges to more expensive classifiers. Third, we train efficient linear probe classifiers and ensemble them with external classifiers to simultaneously improve robustness and reduce computational costs. Together, these techniques yield a production-grade system achieving a 40x computational cost reduction compared to our baseline exchange classifier, while maintaining a 0.05% refusal rate on production traffic. Through extensive red-teaming comprising over 1,700 hours, we demonstrate strong protection against universal jailbreaks -- no attack on this system successfully elicited responses to all eight target queries comparable in detail to an undefended model. Our work establishes Constitutional Classifiers as practical and efficient safeguards for large language models.

preprint2026arXiv

Excess Description Length of Learning Generalizable Predictors

Understanding whether fine-tuning elicits latent capabilities or teaches new ones is a fundamental question for language model evaluation and safety. We develop a formal information-theoretic framework for quantifying how much predictive structure fine-tuning extracts from the train dataset and writes into a model's parameters. Our central quantity, Excess Description Length (EDL), is defined via prequential coding and measures the gap between the bits required to encode training labels sequentially using an evolving model (trained online) and the residual encoding cost under the final trained model. We establish that EDL is non-negative in expectation, converges to surplus description length in the infinite-data limit, and provides bounds on expected generalization gain. Through a series of toy models, we clarify common confusions about information in learning: why random labels yield EDL near zero, how a single example can eliminate many bits of uncertainty about the underlying rule(s) that describe the data distribution, why structure learned on rare inputs contributes proportionally little to expected generalization, and how format learning creates early transients distinct from capability acquisition. This framework provides rigorous foundations for the empirical observation that capability elicitation and teaching exhibit qualitatively distinct scaling signatures.

preprint2022arXiv

Safe Deep RL in 3D Environments using Human Feedback

Agents should avoid unsafe behaviour during both training and deployment. This typically requires a simulator and a procedural specification of unsafe behaviour. Unfortunately, a simulator is not always available, and procedurally specifying constraints can be difficult or impossible for many real-world tasks. A recently introduced technique, ReQueST, aims to solve this problem by learning a neural simulator of the environment from safe human trajectories, then using the learned simulator to efficiently learn a reward model from human feedback. However, it is yet unknown whether this approach is feasible in complex 3D environments with feedback obtained from real humans - whether sufficient pixel-based neural simulator quality can be achieved, and whether the human data requirements are viable in terms of both quantity and quality. In this paper we answer this question in the affirmative, using ReQueST to train an agent to perform a 3D first-person object collection task using data entirely from human contractors. We show that the resulting agent exhibits an order of magnitude reduction in unsafe behaviour compared to standard reinforcement learning.

preprint2022arXiv

Training language models to follow instructions with human feedback

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

preprint2020arXiv

Hidden Incentives for Auto-Induced Distributional Shift

Decisions made by machine learning systems have increasing influence on the world, yet it is common for machine learning algorithms to assume that no such influence exists. An example is the use of the i.i.d. assumption in content recommendation. In fact, the (choice of) content displayed can change users' perceptions and preferences, or even drive them away, causing a shift in the distribution of users. We introduce the term auto-induced distributional shift (ADS) to describe the phenomenon of an algorithm causing a change in the distribution of its own inputs. Our goal is to ensure that machine learning systems do not leverage ADS to increase performance when doing so could be undesirable. We demonstrate that changes to the learning algorithm, such as the introduction of meta-learning, can cause hidden incentives for auto-induced distributional shift (HI-ADS) to be revealed. To address this issue, we introduce `unit tests' and a mitigation strategy for HI-ADS, as well as a toy environment for modelling real-world issues with HI-ADS in content recommendation, where we demonstrate that strong meta-learners achieve gains in performance via ADS. We show meta-learning and Q-learning both sometimes fail unit tests, but pass when using our mitigation strategy.

preprint2020arXiv

Pitfalls of learning a reward function online

In some agent designs like inverse reinforcement learning an agent needs to learn its own reward function. Learning the reward function and optimising for it are typically two different processes, usually performed at different stages. We consider a continual (``one life'') learning approach where the agent both learns the reward function and optimises for it at the same time. We show that this comes with a number of pitfalls, such as deliberately manipulating the learning process in one direction, refusing to learn, ``learning'' facts already known to the agent, and making decisions that are strictly dominated (for all relevant reward functions). We formally introduce two desirable properties: the first is `unriggability', which prevents the agent from steering the learning process in the direction of a reward function that is easier to optimise. The second is `uninfluenceability', whereby the reward-function learning process operates by learning facts about the environment. We show that an uninfluenceable process is automatically unriggable, and if the set of possible environments is sufficiently rich, the converse is true too.

preprint2016arXiv

A Formal Solution to the Grain of Truth Problem

A Bayesian agent acting in a multi-agent environment learns to predict the other agents' policies if its prior assigns positive probability to them (in other words, its prior contains a \emph{grain of truth}). Finding a reasonably large class of policies that contains the Bayes-optimal policies with respect to this class is known as the \emph{grain of truth problem}. Only small classes are known to have a grain of truth and the literature contains several related impossibility results. In this paper we present a formal and general solution to the full grain of truth problem: we construct a class of policies that contains all computable policies as well as Bayes-optimal policies for every lower semicomputable prior over the class. When the environment is unknown, Bayes-optimal agents may fail to act optimally even asymptotically. However, agents based on Thompson sampling converge to play ε-Nash equilibria in arbitrary unknown computable multi-agent environments. While these results are purely theoretical, we show that they can be computationally approximated arbitrarily closely.

preprint2016arXiv

Exploration Potential

We introduce exploration potential, a quantity that measures how much a reinforcement learning agent has explored its environment class. In contrast to information gain, exploration potential takes the problem's reward structure into account. This leads to an exploration criterion that is both necessary and sufficient for asymptotic optimality (learning to act optimally across the entire environment class). Our experiments in multi-armed bandits use exploration potential to illustrate how different algorithms make the tradeoff between exploration and exploitation.

preprint2016arXiv

Geometric Nontermination Arguments

We present a new kind of nontermination argument, called geometric nontermination argument. The geometric nontermination argument is a finite representation of an infinite execution that has the form of a sum of several geometric series. For so-called linear lasso programs we can decide the existence of a geometric nontermination argument using a nonlinear algebraic $\exists$-constraint. We show that a deterministic conjunctive loop program with nonnegative eigenvalues is nonterminating if an only if there exists a geometric nontermination argument. Furthermore, we present an evaluation that demonstrates that our method is feasible in practice.

preprint2016arXiv

Loss Bounds and Time Complexity for Speed Priors

This paper establishes for the first time the predictive performance of speed priors and their computational complexity. A speed prior is essentially a probability distribution that puts low probability on strings that are not efficiently computable. We propose a variant to the original speed prior (Schmidhuber, 2002), and show that our prior can predict sequences drawn from probability measures that are estimable in polynomial time. Our speed prior is computable in doubly-exponential time, but not in polynomial time. On a polynomial time computable sequence our speed prior is computable in exponential time. We show better upper complexity bounds for Schmidhuber's speed prior under the same conditions, and that it predicts deterministic sequences that are computable in polynomial time; however, we also show that it is not computable in polynomial time, and the question of its predictive properties for stochastic sequences remains open.

preprint2016arXiv

Nonparametric General Reinforcement Learning

Reinforcement learning (RL) problems are often phrased in terms of Markov decision processes (MDPs). In this thesis we go beyond MDPs and consider RL in environments that are non-Markovian, non-ergodic and only partially observable. Our focus is not on practical algorithms, but rather on the fundamental underlying problems: How do we balance exploration and exploitation? How do we explore optimally? When is an agent optimal? We follow the nonparametric realizable paradigm. We establish negative results on Bayesian RL agents, in particular AIXI. We show that unlucky or adversarial choices of the prior cause the agent to misbehave drastically. Therefore Legg-Hutter intelligence and balanced Pareto optimality, which depend crucially on the choice of the prior, are entirely subjective. Moreover, in the class of all computable environments every policy is Pareto optimal. This undermines all existing optimality properties for AIXI. However, there are Bayesian approaches to general RL that satisfy objective optimality guarantees: We prove that Thompson sampling is asymptotically optimal in stochastic environments in the sense that its value converges to the value of the optimal policy. We connect asymptotic optimality to regret given a recoverability assumption on the environment that allows the agent to recover from mistakes. Hence Thompson sampling achieves sublinear regret in these environments. Our results culminate in a formal solution to the grain of truth problem: A Bayesian agent acting in a multi-agent environment learns to predict the other agents' policies if its prior assigns positive probability to them (the prior contains a grain of truth). We construct a large but limit computable class containing a grain of truth and show that agents based on Thompson sampling over this class converge to play Nash equilibria in arbitrary unknown computable multi-agent environments.

preprint2016arXiv

Thompson Sampling is Asymptotically Optimal in General Environments

We discuss a variant of Thompson sampling for nonparametric reinforcement learning in a countable classes of general stochastic environments. These environments can be non-Markov, non-ergodic, and partially observable. We show that Thompson sampling learns the environment class in the sense that (1) asymptotically its value converges to the optimal value in mean and (2) given a recoverability assumption regret is sublinear.

preprint2015arXiv

A Definition of Happiness for Reinforcement Learning Agents

What is happiness for reinforcement learning agents? We seek a formal definition satisfying a list of desiderata. Our proposed definition of happiness is the temporal difference error, i.e. the difference between the value of the obtained reward and observation and the agent's expectation of this value. This definition satisfies most of our desiderata and is compatible with empirical research on humans. We state several implications and discuss examples.

preprint2015arXiv

Bad Universal Priors and Notions of Optimality

A big open question of algorithmic information theory is the choice of the universal Turing machine (UTM). For Kolmogorov complexity and Solomonoff induction we have invariance theorems: the choice of the UTM changes bounds only by a constant. For the universally intelligent agent AIXI (Hutter, 2005) no invariance theorem is known. Our results are entirely negative: we discuss cases in which unlucky or adversarial choices of the UTM cause AIXI to misbehave drastically. We show that Legg-Hutter intelligence and thus balanced Pareto optimality is entirely subjective, and that every policy is Pareto optimal in the class of all computable environments. This undermines all existing optimality properties for AIXI. While it may still serve as a gold standard for AI, our results imply that AIXI is a relative theory, dependent on the choice of the UTM.

preprint2015arXiv

On the Computability of AIXI

How could we solve the machine learning and the artificial intelligence problem if we had infinite computation? Solomonoff induction and the reinforcement learning agent AIXI are proposed answers to this question. Both are known to be incomputable. In this paper, we quantify this using the arithmetical hierarchy, and prove upper and corresponding lower bounds for incomputability. We show that AIXI is not limit computable, thus it cannot be approximated using finite computation. Our main result is a limit-computable ε-optimal version of AIXI with infinite horizon that maximizes expected rewards.

preprint2015arXiv

On the Computability of Solomonoff Induction and Knowledge-Seeking

Solomonoff induction is held as a gold standard for learning, but it is known to be incomputable. We quantify its incomputability by placing various flavors of Solomonoff's prior M in the arithmetical hierarchy. We also derive computability bounds for knowledge-seeking agents, and give a limit-computable weakly asymptotically optimal reinforcement learning agent.

preprint2015arXiv

Sequential Extensions of Causal and Evidential Decision Theory

Moving beyond the dualistic view in AI where agent and environment are separated incurs new challenges for decision making, as calculation of expected utility is no longer straightforward. The non-dualistic decision theory literature is split between causal decision theory and evidential decision theory. We extend these decision algorithms to the sequential setting where the agent alternates between taking actions and observing their consequences. We find that evidential decision theory has two natural extensions while causal decision theory only has one.

preprint2015arXiv

Solomonoff Induction Violates Nicod's Criterion

Nicod's criterion states that observing a black raven is evidence for the hypothesis H that all ravens are black. We show that Solomonoff induction does not satisfy Nicod's criterion: there are time steps in which observing black ravens decreases the belief in H. Moreover, while observing any computable infinite string compatible with H, the belief in H decreases infinitely often when using the unnormalized Solomonoff prior, but only finitely often when using the normalized Solomonoff prior. We argue that the fault is not with Solomonoff induction; instead we should reject Nicod's criterion.

preprint2014arXiv

Geometric Series as Nontermination Arguments for Linear Lasso Programs

We present a new kind of nontermination argument for linear lasso programs, called geometric nontermination argument. A geometric nontermination argument is a finite representation of an infinite execution of the form $(\vec{x} + \sum_{i=0}^t λ^i \vec{y})_{t \geq 0}$. The existence of this nontermination argument can be stated as a set of nonlinear algebraic constraints. We show that every linear loop program that has a bounded infinite execution also has a geometric nontermination argument. Furthermore, we discuss nonterminating programs that do not have a geometric nontermination argument.

preprint2014arXiv

Indefinitely Oscillating Martingales

We construct a class of nonnegative martingale processes that oscillate indefinitely with high probability. For these processes, we state a uniform rate of the number of oscillations and show that this rate is asymptotically close to the theoretical upper bound. These bounds on probability and expectation of the number of upcrossings are compared to classical bounds from the martingale literature. We discuss two applications. First, our results imply that the limit of the minimum description length operator may not exist. Second, we give bounds on how often one can change one's belief in a given hypothesis when observing a stream of data.

preprint2014arXiv

Linear Ranking for Linear Lasso Programs

The general setting of this work is the constraint-based synthesis of termination arguments. We consider a restricted class of programs called lasso programs. The termination argument for a lasso program is a pair of a ranking function and an invariant. We present the---to the best of our knowledge---first method to synthesize termination arguments for lasso programs that uses linear arithmetic. We prove a completeness theorem. The completeness theorem establishes that, even though we use only linear (as opposed to non-linear) constraint solving, we are able to compute termination arguments in several interesting cases. The key to our method lies in a constraint transformation that replaces a disjunction by a sum.

preprint2014arXiv

Ranking Function Synthesis for Linear Lasso Programs

The scope of this work is the constraint-based synthesis of termination arguments for the restricted class of programs called linear lasso programs. A termination argument consists of a ranking function as well as a set of supporting invariants. We extend existing methods in several ways. First, we use Motzkin's Transposition Theorem instead of Farkas' Lemma. This allows us to consider linear lasso programs that can additionally contain strict inequalities. Existing methods are restricted to non-strict inequalities and equalities. Second, we consider several kinds of ranking functions: affine-linear, piecewise and lexicographic ranking functions. Moreover, we present a novel kind of ranking function called multiphase ranking function which proceeds through a fixed number of phases such that for each phase, there is an affine-linear ranking function. As an abstraction to the synthesis of specific ranking functions, we introduce the notion ranking function template. This enables us to handle all ranking functions in a unified way. Our method relies on non-linear algebraic constraint solving as a subroutine which is known to scale poorly to large problems. As a mitigation we formalize an assessment of the difficulty of our constraints and present an argument why they are of an easier kind than general non-linear constraints. We prove our method to be complete: if there is a termination argument of the form specified by the given ranking function template with a fixed number of affine-linear supporting invariants, then our method will find a termination argument. To our knowledge, the approach we propose is the most powerful technique of synthesis-based discovery of termination arguments for linear lasso programs and encompasses and enhances several methods having been proposed thus far.

preprint2014arXiv

Ranking Templates for Linear Loops

We present a new method for the constraint-based synthesis of termination arguments for linear loop programs based on linear ranking templates. Linear ranking templates are parametrized, well-founded relations such that an assignment to the parameters gives rise to a ranking function. This approach generalizes existing methods and enables us to use templates for many different ranking functions with affine-linear components. We discuss templates for multiphase, piecewise, and lexicographic ranking functions. Because these ranking templates require both strict and non-strict inequalities, we use Motzkin's Transposition Theorem instead of Farkas Lemma to transform the generated $\exists\forall$-constraint into an $\exists$-constraint.

preprint2013arXiv

Synthesis for Polynomial Lasso Programs

We present a method for the synthesis of polynomial lasso programs. These programs consist of a program stem, a set of transitions, and an exit condition, all in the form of algebraic assertions (conjunctions of polynomial equalities). Central to this approach is the discovery of non-linear (algebraic) loop invariants. We extend Sankaranarayanan, Sipma, and Manna's template-based approach and prove a completeness criterion. We perform program synthesis by generating a constraint whose solution is a synthesized program together with a loop invariant that proves the program's correctness. This constraint is non-linear and is passed to an SMT solver. Moreover, we can enforce the termination of the synthesized program with the support of test cases.

Jan Leike

What is connected

Connect this record

See the researcher in context

Building this map preview

24 published item(s)

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

Excess Description Length of Learning Generalizable Predictors

Safe Deep RL in 3D Environments using Human Feedback

Training language models to follow instructions with human feedback

Hidden Incentives for Auto-Induced Distributional Shift

Pitfalls of learning a reward function online

A Formal Solution to the Grain of Truth Problem

Exploration Potential

Geometric Nontermination Arguments

Loss Bounds and Time Complexity for Speed Priors

Nonparametric General Reinforcement Learning

Thompson Sampling is Asymptotically Optimal in General Environments

A Definition of Happiness for Reinforcement Learning Agents

Bad Universal Priors and Notions of Optimality

On the Computability of AIXI

On the Computability of Solomonoff Induction and Knowledge-Seeking

Sequential Extensions of Causal and Evidential Decision Theory

Solomonoff Induction Violates Nicod's Criterion

Geometric Series as Nontermination Arguments for Linear Lasso Programs

Indefinitely Oscillating Martingales

Linear Ranking for Linear Lasso Programs

Ranking Function Synthesis for Linear Lasso Programs

Ranking Templates for Linear Loops

Synthesis for Polynomial Lasso Programs