Source author record

Daniel Soudry

Daniel Soudry appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Neurons and Cognition Neural and Evolutionary Computing Quantitative Methods Applications Artificial Intelligence Computer Vision Cell Behavior Emerging Technologies Hardware Architecture math.NA Numerical Analysis Subcellular Processes

Catalog footprint

What is connected

31works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

From Continual Learning to SGD and Back: Better Rates for Continual Linear Models

We study the common continual learning setup where an overparameterized model is sequentially fitted to a set of jointly realizable tasks. We analyze forgetting, defined as the loss on previously seen tasks, after $k$ iterations. For continual linear models, we prove that fitting a task is equivalent to a single stochastic gradient descent (SGD) step on a modified objective. We develop novel last-iterate SGD upper bounds in the realizable least squares setup and leverage them to derive new results for continual learning. Focusing on random orderings over $T$ tasks, we establish universal forgetting rates, whereas existing rates depend on problem dimensionality or complexity and become prohibitive in highly overparameterized regimes. In continual regression with replacement, we improve the best existing rate from $O((d-\bar{r})/k)$ to $O(\min(1/\sqrt[4]{k}, \sqrt{(d-\bar{r})}/k, \sqrt{T\bar{r}}/k))$, where $d$ is the dimensionality and $\bar{r}$ the average task rank. Furthermore, we establish the first rate for random task orderings without replacement. The resulting rate $O(\min(1/\sqrt[4]{T},\, (d-\bar{r})/T))$ shows that randomization alone, without task repetition, prevents catastrophic forgetting in sufficiently long task sequences. Finally, we prove a matching $O(1/\sqrt[4]{k})$ forgetting rate for continual linear classification on separable data. Our universal rates extend to broader methods, such as block Kaczmarz and POCS, illuminating their loss convergence under i.i.d. and single-pass orderings.

preprint2026arXiv

Normalized Architectures are Natively 4-Bit

Training large language models at 4-bit precision is critical for efficiency. We show that nGPT, an architecture that constrains weights and hidden representations to the unit hypersphere, is inherently more robust to low-precision arithmetic. This removes the need for interventions-such as applying random Hadamard transforms and performing per-tensor scaling calculations-to preserve model quality, and it enables stable end-to-end NVFP4 training. We validate this approach on both a 1.2B dense model and hybrid (Mamba-Transformer) MoE models of up to 3B/30B parameters. We trace this robustness to the dot product: while quantization noise remains largely uncorrelated in both standard and normalized architectures, the signal behaves differently. In nGPT, the hypersphere constraint enhances weak positive correlations among the element-wise products, leading to a constructive accumulation of the signal across the hidden dimension while the noise continues to average out. This yields a higher effective signal-to-noise ratio and a flatter loss landscape, with the effect strengthening as the hidden dimension grows, suggesting increasing advantages at scale. A reference implementation is available at https://github.com/anonymous452026/ngpt-nvfp4

preprint2026arXiv

Retrieval from Within: An Intrinsic Capability of Attention-Based Models

Retrieval-augmented generation (RAG) typically treats retrieval and generation as separate systems. We ask whether an attention-based encoder-decoder can instead retrieve directly from its own internal representations. We introduce INTRA (INTrinsic Retrieval via Attention), a framework where decoder attention queries score pre-encoded evidence chunks that are then directly reused as context for generation. By construction, INTRA unifies retrieval and generation, eliminating the retriever-generator mismatch typical of RAG pipelines. This design also amortizes context encoding by reusing precomputed encoder states across queries. On question-answering benchmarks, INTRA outperforms strong engineered retrieval pipelines on both evidence recall and end-to-end answer quality. Our results demonstrate that attention-based models already possess a retrieval mechanism that can be elicited, rather than added as an external module.

preprint2026arXiv

Workspace Optimization: How to Train Your Agent

Modern agents built on frontier language models often cannot adapt their weights. What, then, remains trainable? We argue it is the agent's \emph{workspace}, the structured external substrate it reads, writes, and tests; we call its evolution workspace optimization. Workspace optimization targets hard multi-turn environments where a frontier model has strong priors but cannot solve the task in a single shot, so the agent must learn through interaction. We propose a principled way to evolve the workspace, mirroring the structure of weight-space training: artifacts in place of parameters, evidence in place of data, counterexamples in place of losses, and textual feedback in place of gradients. We instantiate the idea in DreamTeam, a multi-agent harness for ARC-AGI-3 whose roles build an executable world model, plan, hypothesize, probe, strategize, and route failures. On the current 25-game ARC-AGI-3 public set under the official scoring protocol and averaged over two independent runs, DreamTeam improves the SOTA protocol-matched agent's score from 36% to 38.4%, while using 31% fewer environment actions per game.

preprint2022arXiv

A statistical framework for efficient out of distribution detection in deep neural networks

Background. Commonly, Deep Neural Networks (DNNs) generalize well on samples drawn from a distribution similar to that of the training set. However, DNNs' predictions are brittle and unreliable when the test samples are drawn from a dissimilar distribution. This is a major concern for deployment in real-world applications, where such behavior may come at a considerable cost, such as industrial production lines, autonomous vehicles, or healthcare applications. Contributions. We frame Out Of Distribution (OOD) detection in DNNs as a statistical hypothesis testing problem. Tests generated within our proposed framework combine evidence from the entire network. Unlike previous OOD detection heuristics, this framework returns a $p$-value for each test sample. It is guaranteed to maintain the Type I Error (T1E - incorrectly predicting OOD for an actual in-distribution sample) for test data. Moreover, this allows to combine several detectors while maintaining the T1E. Building on this framework, we suggest a novel OOD procedure based on low-order statistics. Our method achieves comparable or better results than state-of-the-art methods on well-accepted OOD benchmarks, without retraining the network parameters or assuming prior knowledge on the test distribution -- and at a fraction of the computational cost.

preprint2022arXiv

How catastrophic can catastrophic forgetting be in linear regression?

To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when T tasks in d dimensions are presented cyclically for k iterations, we prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the T^2 factor can be lifted when tasks are presented in a random ordering.

preprint2022arXiv

Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate

Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do not hold for monotone loss functions used for classification, such as the logistic loss. We prove our result on a fixed dataset, both for sampling with or without replacement. Furthermore, for logistic loss (and similar exponentially-tailed losses), we prove that with SGD the weight vector converges in direction to the $L_2$ max margin vector as $O(1/\log(t))$ for almost all separable datasets, and the loss converges as $O(1/t)$ - similarly to gradient descent. Lastly, we examine the case of a fixed learning rate proportional to the minibatch size. We prove that in this case, the asymptotic convergence rate of SGD (with replacement) does not depend on the minibatch size in terms of epochs, if the support vectors span the data. These results may suggest an explanation to similar behaviors observed in deep networks, when trained with SGD.

preprint2022arXiv

Training of Quantized Deep Neural Networks using a Magnetic Tunnel Junction-Based Synapse

Quantized neural networks (QNNs) are being actively researched as a solution for the computational complexity and memory intensity of deep neural networks. This has sparked efforts to develop algorithms that support both inference and training with quantized weight and activation values, without sacrificing accuracy. A recent example is the GXNOR framework for stochastic training of ternary (TNN) and binary (BNN) neural networks. In this paper, we show how magnetic tunnel junction (MTJ) devices can be used to support QNN training. We introduce a novel hardware synapse circuit that uses the MTJ stochastic behavior to support the quantize update. The proposed circuit enables processing near memory (PNM) of QNN training, which subsequently reduces data movement. We simulated MTJ-based stochastic training of a TNN over the MNIST, SVHN, and CIFAR10 datasets and achieved an accuracy of 98.61%, 93.99% and 82.71%, respectively (less than 1% degradation compared to the GXNOR algorithm). We evaluated the synapse array performance potential and showed that the proposed synapse circuit can train ternary networks in situ, with 18.3TOPs/W for feedforward and 3TOPs/W for weight update.

preprint2021arXiv

On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent

Recent work has highlighted the role of initialization scale in determining the structure of the solutions that gradient methods converge to. In particular, it was shown that large initialization leads to the neural tangent kernel regime solution, whereas small initialization leads to so called "rich regimes". However, the initialization structure is richer than the overall scale alone and involves relative magnitudes of different weights and layers in the network. Here we show that these relative scales, which we refer to as initialization shape, play an important role in determining the learned model. We develop a novel technique for deriving the inductive bias of gradient-flow and use it to obtain closed-form implicit regularizers for multiple cases of interest.

preprint2020arXiv

At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?

Background: Recent developments have made it possible to accelerate neural networks training significantly using large batch sizes and data parallelism. Training in an asynchronous fashion, where delay occurs, can make training even more scalable. However, asynchronous training has its pitfalls, mainly a degradation in generalization, even after convergence of the algorithm. This gap remains not well understood, as theoretical analysis so far mainly focused on the convergence rate of asynchronous methods. Contributions: We examine asynchronous training from the perspective of dynamical stability. We find that the degree of delay interacts with the learning rate, to change the set of minima accessible by an asynchronous stochastic gradient descent algorithm. We derive closed-form rules on how the learning rate could be changed, while keeping the accessible set the same. Specifically, for high delay values, we find that the learning rate should be kept inversely proportional to the delay. We then extend this analysis to include momentum. We find momentum should be either turned off, or modified to improve training stability. We provide empirical experiments to validate our theoretical findings.

preprint2020arXiv

Beyond Signal Propagation: Is Feature Diversity Necessary in Deep Neural Network Initialization?

Deep neural networks are typically initialized with random weights, with variances chosen to facilitate signal propagation and stable gradients. It is also believed that diversity of features is an important property of these initializations. We construct a deep convolutional network with identical features by initializing almost all the weights to $0$. The architecture also enables perfect signal propagation and stable gradients, and achieves high accuracy on standard benchmarks. This indicates that random, diverse initializations are \textit{not} necessary for training neural networks. An essential element in training this network is a mechanism of symmetry breaking; we study this phenomenon and find that standard GPU operations, which are non-deterministic, can serve as a sufficient source of symmetry breaking to enable training.

preprint2020arXiv

Characterizing Implicit Bias in Terms of Optimization Geometry

We study the implicit bias of generic optimization methods, such as mirror descent, natural gradient descent, and steepest descent with respect to different potentials and norms, when optimizing underdetermined linear regression or separable linear classification problems. We explore the question of whether the specific global minimum (among the many possible global minima) reached by an algorithm can be characterized in terms of the potential or norm of the optimization geometry, and independently of hyperparameter choices such as step-size and momentum.

preprint2020arXiv

Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy

We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks". This is the simplest model displaying a transition between "kernel" and non-kernel ("rich" or "active") regimes. We show how the transition is controlled by the relationship between the initialization scale and how accurately we minimize the training loss. Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies (well beyond $10^{-100}$). Moreover, the implicit bias at reasonable initialization scales and training accuracies is more complex and not captured by these limits.

preprint2020arXiv

Is Feature Diversity Necessary in Neural Network Initialization?

Standard practice in training neural networks involves initializing the weights in an independent fashion. The results of recent work suggest that feature "diversity" at initialization plays an important role in training the network. However, other initialization schemes with reduced feature diversity have also been shown to be viable. In this work, we conduct a series of experiments aimed at elucidating the importance of feature diversity at initialization. We show that a complete lack of diversity is harmful to training, but its effects can be counteracted by a relatively small addition of noise - even the noise in standard non-deterministic GPU computations is sufficient. Furthermore, we construct a deep convolutional network with identical features at initialization and almost all of the weights initialized at 0 that can be trained to reach accuracy matching its standard-initialized counterpart.

preprint2020arXiv

Kernel and Rich Regimes in Overparametrized Models

A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach, we show how the scale of the initialization controls the transition between the "kernel" (aka lazy) and "rich" (aka active) regimes and affects generalization properties in multilayer homogeneous models. We provide a complete and detailed analysis for a simple two-layer model that already exhibits an interesting and meaningful transition between the kernel and rich regimes, and we demonstrate the transition for more complex matrix factorization models and multilayer non-linear networks.

preprint2020arXiv

Kernel and Rich Regimes in Overparametrized Models

A recent line of work studies overparametrized neural networks in the "kernel regime," i.e. when the network behaves during training as a kernelized linear predictor, and thus training with gradient descent has the effect of finding the minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized multilayer networks can induce rich implicit biases that are not RKHS norms. Building on an observation by Chizat and Bach, we show how the scale of the initialization controls the transition between the "kernel" (aka lazy) and "rich" (aka active) regimes and affects generalization properties in multilayer homogeneous models. We also highlight an interesting role for the width of a model in the case that the predictor is not identically zero at initialization. We provide a complete and detailed analysis for a family of simple depth-$D$ models that already exhibit an interesting and meaningful transition between the kernel and rich regimes, and we also demonstrate this transition empirically for more complex matrix factorization models and multilayer non-linear networks.

preprint2020arXiv

The Knowledge Within: Methods for Data-Free Model Compression

Recently, an extensive amount of research has been focused on compressing and accelerating Deep Neural Networks (DNN). So far, high compression rate algorithms require part of the training dataset for a low precision calibration, or a fine-tuning process. However, this requirement is unacceptable when the data is unavailable or contains sensitive information, as in medical and biometric use-cases. We present three methods for generating synthetic samples from trained models. Then, we demonstrate how these samples can be used to calibrate and fine-tune quantized models without using any real data in the process. Our best performing method has a negligible accuracy degradation compared to the original training set. This method, which leverages intrinsic batch normalization layers' statistics of the trained model, can be used to evaluate data similarity. Our approach opens a path towards genuine data-free model compression, alleviating the need for training data during model deployment.

preprint2016arXiv

Binarized Neural Networks

We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time and when computing the parameters' gradient at train-time. We conduct two sets of experiments, each based on a different framework, namely Torch7 and Theano, where we train BNNs on MNIST, CIFAR-10 and SVHN, and achieve nearly state-of-the-art results. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which might lead to a great increase in power-efficiency. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available.

preprint2016arXiv

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

We introduce a method to train Binarized Neural Networks (BNNs) - neural networks with binary weights and activations at run-time. At training-time the binary weights and activations are used for computing the parameters gradients. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which is expected to substantially improve power-efficiency. To validate the effectiveness of BNNs we conduct two sets of experiments on the Torch7 and Theano frameworks. On both, BNNs achieved nearly state-of-the-art results over the MNIST, CIFAR-10 and SVHN datasets. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available on-line.

preprint2016arXiv

No bad local minima: Data independent training error guarantees for multilayer neural networks

We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more than one hidden layer. Our theoretical guarantees assume essentially nothing on the training data, and are verified numerically. These results suggest why the highly non-convex loss of such MNNs can be easily optimized using local updates (e.g., stochastic gradient descent), as observed empirically.

preprint2016arXiv

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

We introduce a method to train Quantized Neural Networks (QNNs) --- neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At train-time the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves $51\%$ top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.

preprint2015arXiv

Training Binary Multilayer Neural Networks for Image Classification using Expectation Backpropagation

Compared to Multilayer Neural Networks with real weights, Binary Multilayer Neural Networks (BMNNs) can be implemented more efficiently on dedicated hardware. BMNNs have been demonstrated to be effective on binary classification tasks with Expectation BackPropagation (EBP) algorithm on high dimensional text datasets. In this paper, we investigate the capability of BMNNs using the EBP algorithm on multiclass image classification tasks. The performances of binary neural networks with multiple hidden layers and different numbers of hidden units are examined on MNIST. We also explore the effectiveness of image spatial filters and the dropout technique in BMNNs. Experimental results on MNIST dataset show that EBP can obtain 2.12% test error with binary weights and 1.66% test error with real weights, which is comparable to the results of standard BackPropagation algorithm on fully connected MNNs.

preprint2014arXiv

A shotgun sampling solution for the common input problem in neural connectivity inference

Inferring connectivity in neuronal networks remains a key challenge in statistical neuroscience. The `common input' problem presents the major roadblock: it is difficult to reliably distinguish causal connections between pairs of observed neurons from correlations induced by common input from unobserved neurons. Since available recording techniques allow us to sample from only a small fraction of large networks simultaneously with sufficient temporal resolution, naive connectivity estimators that neglect these common input effects are highly biased. This work proposes a `shotgun' experimental design, in which we observe multiple sub-networks briefly, in a serial manner. Thus, while the full network cannot be observed simultaneously at any given time, we may be able to observe most of it during the entire experiment. Using a generalized linear model for a spiking recurrent neural network, we develop scalable approximate Bayesian methods to perform network inference given this type of data, in which only a small fraction of the network is observed in each time bin. We demonstrate in simulation that, using this method: (1) The shotgun experimental design can eliminate the biases induced by common input effects. (2) Networks with thousands of neurons, in which only a small fraction of the neurons is observed in each time bin, could be quickly and accurately estimated. (3) Performance can be improved if we exploit prior information about the probability of having a connection between two neurons, its dependence on neuronal cell types (e.g., Dale's law), or its dependence on the distance between neurons.

preprint2014arXiv

A structured matrix factorization framework for large scale calcium imaging data analysis

We present a structured matrix factorization approach to analyzing calcium imaging recordings of large neuronal ensembles. Our goal is to simultaneously identify the locations of the neurons, demix spatially overlapping components, and denoise and deconvolve the spiking activity of each neuron from the slow dynamics of the calcium indicator. The matrix factorization approach relies on the observation that the spatiotemporal fluorescence activity can be expressed as a product of two matrices: a spatial matrix that encodes the location of each neuron in the optical field and a temporal matrix that characterizes the calcium concentration of each neuron over time. We present a simple approach for estimating the dynamics of the calcium indicator as well as the observation noise statistics from the observed data. These parameters are then used to set up the matrix factorization problem in a constrained form that requires no further parameter tuning. We discuss initialization and post-processing techniques that enhance the performance of our method, along with efficient and largely parallelizable algorithms. We apply our method to {\it in vivo} large scale multi-neuronal imaging data and also demonstrate how similar methods can be used for the analysis of {\it in vivo} dendritic imaging data.

preprint2014arXiv

Spiking input-output relation for general biophysical neuron models

Cortical neurons include many sub-cellular processes, operating at multiple timescales, which may affect their response to stimulation through non-linear and stochastic interaction with ion channels and ionic concentrations. Since new processes are constantly being discovered, biophysical neuron models increasingly become "too complex to be useful" yet "too simple to be realistic". A fundamental open question in theoretical neuroscience pertains to how this deadlock may be resolved. In order to tackle this problem, we first define the notion of a "excitable neuron model". Then we analytically derive the input-output relation of such neuronal models, relating input spike trains to output spikes based on known biophysical properties. Thus we obtain closed-form expressions for the mean firing rates, all second order statistics (input-state-output correlation and spectra) and construct optimal linear estimators for the neuronal response and internal state. These results are guaranteed to hold, given a few generic assumptions, for any stochastic biophysical neuron model (with an arbitrary number of slow kinetic processes) under general sparse stimulation. This solution suggests that the common simplifying approach that ignores much of the complexity of the neuron might actually be unnecessary and even deleterious in some cases. Specifically, the stochasticity of ion channels and the temporal sparseness of inputs is exactly what rendered our analysis tractable, allowing us to incorporate slow kinetics.

preprint2014arXiv

The neuron's response at extended timescales

Many systems are modulated by unknown slow processes. This hinders analysis in highly non-linear systems, such as excitable systems. We show that for such systems, if the input matches the sparse `spiky' nature of the output, the spiking input-output relation can be derived. We use this relation to reproduce and interpret the irregular and complex 1/f response observed in isolated neurons stimulated over days. We decompose the neuronal response into contributions from its long history of internal noise and its short (few minutes) history of inputs, quantifying memory, noise and stability.

preprint2013arXiv

Mean Field Bayes Backpropagation: scalable training of multilayer neural networks with binary weights

Significant success has been reported recently using deep neural networks for classification. Such large networks can be computationally intensive, even after training is over. Implementing these trained networks in hardware chips with a limited precision of synaptic weights may improve their speed and energy efficiency by several orders of magnitude, thus enabling their integration into small and low-power electronic devices. With this motivation, we develop a computationally efficient learning algorithm for multilayer neural networks with binary weights, assuming all the hidden neurons have a fan-out of one. This algorithm, derived within a Bayesian probabilistic online setting, is shown to work well for both synthetic and real-world problems, performing comparably to algorithms with real-valued weights, while retaining computational tractability.

preprint2013arXiv

Slow dynamics of neuronal excitability under pulse stimulation

Neurons fire irregularly on multiple timescales when stimulated with a periodic pulse train. This raises two questions: Does this irregularity imply significant intrinsic stochasticity? Can existing neuron models be readily extended to describe behavior at long timescales? We show here that for commonly studied neuronal models, dynamics is not chaotic and can only produce stable and periodic firing patterns. This is done by transforming the neuron model to an analytically tractable piecewise linear discrete map. Thus we answer "yes" and "no" to the above questions, respectively.

preprint2012arXiv

An exact reduction of the master equation to a strictly stable system with an explicit expression for the stationary distribution

The evolution of a continuous time Markov process with a finite number of states is usually calculated by the Master equation - a linear differential equations with a singular generator matrix. We derive a general method for reducing the dimensionality of the Master equation by one by using the probability normalization constraint, thus obtaining a affine differential equation with a (non-singular) stable generator matrix. Additionally, the reduced form yields a simple explicit expression for the stationary probability distribution, which is usually derived implicitly. Finally, we discuss the application of this method to stochastic differential equations.

preprint2011arXiv

Simple, Fast and Accurate Implementation of the Diffusion Approximation Algorithm for Stochastic Ion Channels with Multiple States

The phenomena that emerge from the interaction of the stochastic opening and closing of ion channels (channel noise) with the non-linear neural dynamics are essential to our understanding of the operation of the nervous system. The effects that channel noise can have on neural dynamics are generally studied using numerical simulations of stochastic models. Algorithms based on discrete Markov Chains (MC) seem to be the most reliable and trustworthy, but even optimized algorithms come with a non-negligible computational cost. Diffusion Approximation (DA) methods use Stochastic Differential Equations (SDE) to approximate the behavior of a number of MCs, considerably speeding up simulation times. However, model comparisons have suggested that DA methods did not lead to the same results as in MC modeling in terms of channel noise statistics and effects on excitability. Recently, it was shown that the difference arose because MCs were modeled with coupled activation subunits, while the DA was modeled using uncoupled activation subunits. Implementations of DA with coupled subunits, in the context of a specific kinetic scheme, yielded similar results to MC. However, it remained unclear how to generalize these implementations to different kinetic schemes, or whether they were faster than MC algorithms. Additionally, a steady state approximation was used for the stochastic terms, which, as we show here, can introduce significant inaccuracies. We derived the SDE explicitly for any given ion channel kinetic scheme. The resulting generic equations were surprisingly simple and interpretable - allowing an easy and efficient DA implementation. The algorithm was tested in a voltage clamp simulation and in two different current clamp simulations, yielding the same results as MC modeling. Also, the simulation efficiency of this DA method demonstrated considerable superiority over MC methods.

preprint2010arXiv

History dependent dynamics in a generic model of ion channels - an analytic study

Recent experiments have demonstrated that the timescale of adaptation of single neurons and ion channel populations to stimuli slows down as the length of stimulation increases; in fact, no upper bound on temporal time-scales seems to exist in such systems. Furthermore, patch clamp experiments on single ion channels have hinted at the existence of large, mostly unobservable, inactivation state spaces within a single ion channel. This raises the question of the relation between this multitude of inactivation states and the observed behavior. In this work we propose a minimal model for ion channel dynamics which does not assume any specific structure of the inactivation state space. The model is simple enough to render an analytical study possible. This leads to a clear and concise explanation of the experimentally observed exponential history-dependent relaxation in sodium channels in a voltage clamp setting, and shows that their recovery rate from slow inactivation must be voltage dependent. Furthermore, we predict that history-dependent relaxation cannot be created by overly sparse spiking activity. While the model was created with ion channel populations in mind, its simplicity and genericalness render it a good starting point for modeling similar effects in other systems, and for scaling up to higher levels such as single neurons which are also known to exhibit multiple time scales.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint

Fields this researcher appears in

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2504.04579:author:6:daniel-soudry

Imported May 21, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.06067:author:4:daniel-soudry

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.05806:author:5:daniel-soudry

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.09650:author:4:daniel-soudry

Imported May 20, 2026Synced May 21, 2026

7 works

Nathan Srebro

Researcher

Nathan Srebro contributes to research discovery and scholarly infrastructure.

Open to collaborate

6 works

Ron Meir

Researcher

Ron Meir contributes to research discovery and scholarly infrastructure.

Open to collaborate

5 works

Edward Moroshko

Researcher

Edward Moroshko contributes to research discovery and scholarly infrastructure.

Open to collaborate

5 works

Itay Hubara

Researcher

Itay Hubara contributes to research discovery and scholarly infrastructure.

Open to collaborate

Daniel Soudry

What is connected

Connect this record

See the researcher in context

Building this map preview

31 published item(s)

From Continual Learning to SGD and Back: Better Rates for Continual Linear Models

Normalized Architectures are Natively 4-Bit

Retrieval from Within: An Intrinsic Capability of Attention-Based Models

Workspace Optimization: How to Train Your Agent

A statistical framework for efficient out of distribution detection in deep neural networks

How catastrophic can catastrophic forgetting be in linear regression?

Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate

Training of Quantized Deep Neural Networks using a Magnetic Tunnel Junction-Based Synapse

On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent

At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?

Beyond Signal Propagation: Is Feature Diversity Necessary in Deep Neural Network Initialization?

Characterizing Implicit Bias in Terms of Optimization Geometry

Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy

Is Feature Diversity Necessary in Neural Network Initialization?

Kernel and Rich Regimes in Overparametrized Models

Kernel and Rich Regimes in Overparametrized Models

The Knowledge Within: Methods for Data-Free Model Compression

Binarized Neural Networks

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

No bad local minima: Data independent training error guarantees for multilayer neural networks

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

Training Binary Multilayer Neural Networks for Image Classification using Expectation Backpropagation

A shotgun sampling solution for the common input problem in neural connectivity inference

A structured matrix factorization framework for large scale calcium imaging data analysis

Spiking input-output relation for general biophysical neuron models

The neuron's response at extended timescales

Mean Field Bayes Backpropagation: scalable training of multilayer neural networks with binary weights

Slow dynamics of neuronal excitability under pulse stimulation

An exact reduction of the master equation to a strictly stable system with an explicit expression for the stationary distribution

Simple, Fast and Accurate Implementation of the Diffusion Approximation Algorithm for Stochastic Ion Channels with Multiple States

History dependent dynamics in a generic model of ion channels - an analytic study