Source author record

Anirbit Mukherjee

Anirbit Mukherjee appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Information Theory math.IT astro-ph.CO gr-qc hep-th math.FA math.PR physics.data-an

Catalog footprint

What is connected

6works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Convergent Stochastic Training of Attention and Understanding LoRA

Transformers have revolutionized machine learning and deploying attention layers in the model is increasingly standard across a myriad of applications. Further, for large models, it is common to implement Low Rank Adaptation (LoRA), whereby a factorized parameterization of them is trained, to achieve a surprisingly beneficial accuracy-size trade-off. In this work, via a unified framework we rigorously establish trainability of such models under stochastic methods. We prove that for any mild regularization, the empirical regression loss on a attention layer and LoRA on a shallow neural net, both induce Poincaré inequality for the corresponding Gibbs' measure. Then it follows via invoking recent results that a certain SDE, which mimics the SGD, minimizes the corresponding losses. In both the cases, our first-of-its-kind results of trainability on attention and nets, do not rely on any assumptions on the data or the size of the architecture.

preprint2022arXiv

An Empirical Study of the Occurrence of Heavy-Tails in Training a ReLU Gate

A particular direction of recent advance about stochastic deep-learning algorithms has been about uncovering a rather mysterious heavy-tailed nature of the stationary distribution of these algorithms, even when the data distribution is not so. Moreover, the heavy-tail index is known to show interesting dependence on the input dimension of the net, the mini-batch size and the step size of the algorithm. In this short note, we undertake an experimental study of this index for S.G.D. while training a $\relu$ gate (in the realizable and in the binary classification setup) and for a variant of S.G.D. that was proven in Karmakar and Mukherjee (2022) for ReLU realizable data. From our experiments we conjecture that these two algorithms have similar heavy-tail behaviour on any data where the latter can be proven to converge. Secondly, we demonstrate that the heavy-tail index of the late time iterates in this model scenario has strikingly different properties than either what has been proven for linear hypothesis classes or what has been previously demonstrated for large nets.

preprint2022arXiv

Depth-2 Neural Networks Under a Data-Poisoning Attack

In this work, we study the possibility of defending against data-poisoning attacks while training a shallow neural network in a regression setup. We focus on doing supervised learning for a class of depth-2 finite-width neural networks, which includes single-filter convolutional networks. In this class of networks, we attempt to learn the network weights in the presence of a malicious oracle doing stochastic, bounded and additive adversarial distortions on the true output during training. For the non-gradient stochastic algorithm that we construct, we prove worst-case near-optimal trade-offs among the magnitude of the adversarial attack, the weight approximation accuracy, and the confidence achieved by the proposed algorithm. As our algorithm uses mini-batching, we analyze how the mini-batch size affects convergence. We also show how to utilize the scaling of the outer layer weights to counter output-poisoning attacks depending on the probability of attack. Lastly, we give experimental evidence demonstrating how our algorithm outperforms stochastic gradient descent under different input data distributions, including instances of heavy-tailed distributions.

preprint2022arXiv

Provable Training of a ReLU Gate with an Iterative Non-Gradient Algorithm

In this work, we demonstrate provable guarantees on the training of a single ReLU gate in hitherto unexplored regimes. We give a simple iterative stochastic algorithm that can train a ReLU gate in the realizable setting in linear time while using significantly milder conditions on the data distribution than previous such results. Leveraging certain additional moment assumptions, we also show a first-of-its-kind approximate recovery of the true label generating parameters under an (online) data-poisoning attack on the true labels, while training a ReLU gate by the same algorithm. Our guarantee is shown to be nearly optimal in the worst case and its accuracy of recovering the true weight degrades gracefully with increasing probability of attack and its magnitude. For both the realizable and the non-realizable cases as outlined above, our analysis allows for mini-batching and computes how the convergence time scales with the mini-batch size. We corroborate our theorems with simulation results which also bring to light a striking similarity in trajectories between our algorithm and the popular S.G.D. algorithm - for which similar guarantees as here are still unknown.

preprint2015arXiv

N-point correlations of dark matter tracers : Renormalization with univariate biasing and its O(f_{NL}) terms with bivariate biasing

In this note we extend the results of the model of galaxy biasing recently proposed in (http://arxiv.org/abs/1212.0868v2). In that paper the authors had outlined a very precise mathematical framework to deal with the theory of galaxy biasing. In this work we extend that combinatorial technology to renormalize tracer n-point functions in their model of univariate biasing. We further prove that 4 and higher point cumulants of the Bardeen potential don't have an O(f_{NL}) term. Then we use this observation to extract all the O(f_{NL}) terms in the n-point correlation of tracers in their model of bivariate biasing.

preprint2015arXiv

Renyi entropy of the critical O(N) model

In this article we explore a certain definition of "alternate quantization" for the critical O(N) model. We elaborate on a prescription to evaluate the Renyi entropy of alternately quantized critical O(N) model. We show that there exists new saddles of the q-Renyi free energy functional corresponding to putting certain combinations of the Kaluza-Klein modes into alternate quantization. This leads us to an analysis of trying to determine the true state of the theory by trying to ascertain the global minima among these saddle points.

Anirbit Mukherjee

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

Convergent Stochastic Training of Attention and Understanding LoRA

An Empirical Study of the Occurrence of Heavy-Tails in Training a ReLU Gate

Depth-2 Neural Networks Under a Data-Poisoning Attack

Provable Training of a ReLU Gate with an Iterative Non-Gradient Algorithm

N-point correlations of dark matter tracers : Renormalization with univariate biasing and its O(f_{NL}) terms with bivariate biasing

Renyi entropy of the critical O(N) model