Researcher profile

Jonas Kohler

Jonas Kohler contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 13 - UnverifiedVerification L1Unclaimed author
2works
0followers
2topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

2 published item(s)

preprint2020arXiv

Batch Normalization Provably Avoids Rank Collapse for Randomly Initialised Deep Networks

Randomly initialized neural networks are known to become harder to train with increasing depth, unless architectural enhancements like residual connections and batch normalization are used. We here investigate this phenomenon by revisiting the connection between random initialization in deep networks and spectral instabilities in products of random matrices. Given the rich literature on random matrices, it is not surprising to find that the rank of the intermediate representations in unnormalized networks collapses quickly with depth. In this work we highlight the fact that batch normalization is an effective strategy to avoid rank collapse for both linear and ReLU networks. Leveraging tools from Markov chain theory, we derive a meaningful lower rank bound in deep linear networks. Empirically, we also demonstrate that this rank robustness generalizes to ReLU nets. Finally, we conduct an extensive set of experiments on real-world data sets, which confirm that rank stability is indeed a crucial condition for training modern-day deep neural architectures.

preprint2020arXiv

The Role of Memory in Stochastic Optimization

The choice of how to retain information about past gradients dramatically affects the convergence properties of state-of-the-art stochastic optimization methods, such as Heavy-ball, Nesterov's momentum, RMSprop and Adam. Building on this observation, we use stochastic differential equations (SDEs) to explicitly study the role of memory in gradient-based algorithms. We first derive a general continuous-time model that can incorporate arbitrary types of memory, for both deterministic and stochastic settings. We provide convergence guarantees for this SDE for weakly-quasi-convex and quadratically growing functions. We then demonstrate how to discretize this SDE to get a flexible discrete-time algorithm that can implement a board spectrum of memories ranging from short- to long-term. Not only does this algorithm increase the degrees of freedom in algorithmic choice for practitioners but it also comes with better stability properties than classical momentum in the convex stochastic setting. In particular, no iterate averaging is needed for convergence. Interestingly, our analysis also provides a novel interpretation of Nesterov's momentum as stable gradient amplification and highlights a possible reason for its unstable behavior in the (convex) stochastic setting. Furthermore, we discuss the use of long term memory for second-moment estimation in adaptive methods, such as Adam and RMSprop. Finally, we provide an extensive experimental study of the effect of different types of memory in both convex and nonconvex settings.