Source author record

Laurence Aitchison

Laurence Aitchison appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Neurons and Cognition Machine Learning Computation and Language Neural and Evolutionary Computing

Catalog footprint

What is connected

9works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Ministral 3

We introduce the Ministral 3 series, a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three model sizes: 3B, 8B, and 14B parameters. For each model size, we release three variants: a pretrained base model for general-purpose use, an instruction finetuned, and a reasoning model for complex problem-solving. In addition, we present our recipe to derive the Ministral 3 models through Cascade Distillation, an iterative pruning and continued training with distillation technique. Each model comes with image understanding capabilities, all under the Apache 2.0 license.

preprint2022arXiv

Bayesian Neural Network Priors Revisited

Isotropic Gaussian priors are the de facto standard for modern Bayesian neural network inference. However, it is unclear whether these priors accurately reflect our true beliefs about the weight distributions or give optimal performance. To find better priors, we study summary statistics of neural network weights in networks trained using stochastic gradient descent (SGD). We find that convolutional neural network (CNN) and ResNet weights display strong spatial correlations, while fully connected networks (FCNNs) display heavy-tailed weight distributions. We show that building these observations into priors can lead to improved performance on a variety of image classification datasets. Surprisingly, these priors mitigate the cold posterior effect in FCNNs, but slightly increase the cold posterior effect in ResNets.

preprint2022arXiv

What deep reinforcement learning tells us about human motor learning and vice-versa

Machine learning and specifically reinforcement learning (RL) has been extremely successful in helping us to understand neural decision making processes. However, RL's role in understanding other neural processes especially motor learning is much less well explored. To explore this connection, we investigated how recent deep RL methods correspond to the dominant motor learning framework in neuroscience, error-based learning. Error-based learning can be probed using a mirror reversal adaptation paradigm, where it produces distinctive qualitative predictions that are observed in humans. We therefore tested three major families of modern deep RL algorithm on a mirror reversal perturbation. Surprisingly, all of the algorithms failed to mimic human behaviour and indeed displayed qualitatively different behaviour from that predicted by error-based learning. To fill this gap, we introduce a novel deep RL algorithm: model-based deterministic policy gradients (MB-DPG). MB-DPG draws inspiration from error-based learning by explicitly relying on the observed outcome of actions. We show MB-DPG captures (human) error-based learning under mirror-reversal and rotational perturbation. Next, we demonstrate error-based learning in the form of MB-DPG learns faster than canonical model-free algorithms on complex arm-based reaching tasks, while being more robust to (forward) model misspecification than model-based RL. These findings highlight the gap between current deep RL methods and human motor adaptation and offer a route to closing this gap, facilitating future beneficial interaction between between the two fields.

preprint2020arXiv

Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods

We formulate the problem of neural network optimization as Bayesian filtering, where the observations are the backpropagated gradients. While neural network optimization has previously been studied using natural gradient methods which are closely related to Bayesian inference, they were unable to recover standard optimizers such as Adam and RMSprop with a root-mean-square gradient normalizer, instead getting a mean-square normalizer. To recover the root-mean-square normalizer, we find it necessary to account for the temporal dynamics of all the other parameters as they are geing optimized. The resulting optimizer, AdaBayes, adaptively transitions between SGD-like and Adam-like behaviour, automatically recovers AdamW, a state of the art variant of Adam with decoupled weight decay, and has generalisation performance competitive with SGD.

preprint2020arXiv

Why bigger is not always better: on finite and infinite neural networks

Recent work has argued that neural networks can be understood theoretically by taking the number of channels to infinity, at which point the outputs become Gaussian process (GP) distributed. However, we note that infinite Bayesian neural networks lack a key facet of the behaviour of real neural networks: the fixed kernel, determined only by network hyperparameters, implies that they cannot do any form of representation learning. The lack of representation or equivalently kernel learning leads to less flexibility and hence worse performance, giving a potential explanation for the inferior performance of infinite networks observed in the literature (e.g. Novak et al. 2019). We give analytic results characterising the prior over representations and representation learning in finite deep linear networks. We show empirically that the representations in SOTA architectures such as ResNets trained with SGD are much closer to those suggested by our deep linear results than by the corresponding infinite network. This motivates the introduction of a new class of network: infinite networks with bottlenecks, which inherit the theoretical tractability of infinite networks while at the same time allowing representation learning.

preprint2016arXiv

The Hamiltonian brain: efficient probabilistic inference with excitatory-inhibitory neural circuit dynamics

Probabilistic inference offers a principled framework for understanding both behaviour and cortical computation. However, two basic and ubiquitous properties of cortical responses seem difficult to reconcile with probabilistic inference: neural activity displays prominent oscillations in response to constant input, and large transient changes in response to stimulus onset. Here we show that these dynamical behaviours may in fact be understood as hallmarks of the specific representation and algorithm that the cortex employs to perform probabilistic inference. We demonstrate that a particular family of probabilistic inference algorithms, Hamiltonian Monte Carlo (HMC), naturally maps onto the dynamics of excitatory-inhibitory neural networks. Specifically, we constructed a model of an excitatory-inhibitory circuit in primary visual cortex that performed HMC inference, and thus inherently gave rise to oscillations and transients. These oscillations were not mere epiphenomena but served an important functional role: speeding up inference by rapidly spanning a large volume of state space. Inference thus became an order of magnitude more efficient than in a non-oscillatory variant of the model. In addition, the network matched two specific properties of observed neural dynamics that would otherwise be difficult to account for in the context of probabilistic inference. First, the frequency of oscillations as well as the magnitude of transients increased with the contrast of the image stimulus. Second, excitation and inhibition were balanced, and inhibition lagged excitation. These results suggest a new functional role for the separation of cortical populations into excitatory and inhibitory neurons, and for the neural oscillations that emerge in such excitatory-inhibitory networks: enhancing the efficiency of cortical computations.

preprint2016arXiv

Zipf's law arises naturally in structured, high-dimensional data

Zipf's law, which states that the probability of an observation is inversely proportional to its rank, has been observed in many domains. While there are models that explain Zipf's law in each of them, those explanations are typically domain specific. Recently, methods from statistical physics were used to show that a fairly broad class of models does provide a general explanation of Zipf's law. This explanation rests on the observation that real world data is often generated from underlying causes, known as latent variables. Those latent variables mix together multiple models that do not obey Zipf's law, giving a model that does. Here we extend that work both theoretically and empirically. Theoretically, we provide a far simpler and more intuitive explanation of Zipf's law, which at the same time considerably extends the class of models to which this explanation can apply. Furthermore, we also give methods for verifying whether this explanation applies to a particular dataset. Empirically, these advances allowed us extend this explanation to important classes of data, including word frequencies (the first domain in which Zipf's law was discovered), data with variable sequence length, and multi-neuron spiking activity.

preprint2015arXiv

Synaptic sampling: A connection between PSP variability and uncertainty explains neurophysiological observations

When an action potential is transmitted to a postsynaptic neuron, a small change in the postsynaptic neuron's membrane potential occurs. These small changes, known as a postsynaptic potentials (PSPs), are highly variable, and current models assume that this variability is corrupting noise. In contrast, we show that this variability could have an important computational role: representing a synapse's uncertainty about the optimal synaptic weight (i.e. the best possible setting for the synaptic weight). We show that this link between uncertainty and variability, that we call synaptic sampling, leads to more accurate estimates of the uncertainty in task relevant quantities, leading to more effective decision making. Synaptic sampling makes four predictions, all of which have some experimental support. First the more variable a synapse is, the more it should change during LTP protocols. Second, variability should increase as the presynpatic firing rate falls. Third, PSP variance should be proportional to PSP mean. Fourth, variability should increase with distance from the cell soma. We provide support for the first two predictions by reanalysing existing datasets, and we find preexisting data in support of the last two predictions.

preprint2014arXiv

Fast sampling for Bayesian inference in neural circuits

Time is at a premium for recurrent network dynamics, and particularly so when they are stochastic and correlated: the quality of inference from such dynamics fundamentally depends on how fast the neural circuit generates new samples from its stationary distribution. Indeed, behavioral decisions can occur on fast time scales (~100 ms), but it is unclear what neural circuit dynamics afford sampling at such high rates. We analyzed a stochastic form of rate-based linear neuronal network dynamics with synaptic weight matrix $W$, and the dependence on $W$ of the covariance of the stationary distribution of joint firing rates. This covariance $Σ$ can be actively used to represent posterior uncertainty via sampling under a linear-Gaussian latent variable model. The key insight is that the mapping between $W$ and $Σ$ is degenerate: there are infinitely many $W$'s that lead to sampling from the same $Σ$ but differ greatly in the speed at which they sample. We were able to explicitly separate these extra degrees of freedom in a parametric form and thus study their effects on sampling speed. We show that previous proposals for probabilistic sampling in neural circuits correspond to using a symmetric $W$ which violates Dale's law and results in critically slow sampling, even for moderate stationary correlations. In contrast, optimizing network dynamics for speed consistently yielded asymmetric $W$'s and dynamics characterized by fast transients, such that samples of network activity became fully decorrelated over ~10 ms. Importantly, networks with separate excitatory/inhibitory populations proved to be particularly efficient samplers, and were in the balanced regime. Thus, plausible neural circuit dynamics can perform fast sampling for efficient decoding and inference.