Researcher profile

Aurelien Lucchi

Aurelien Lucchi contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2022arXiv

A Full $w$CDM Analysis of KiDS-1000 Weak Lensing Maps using Deep Learning

We present a full forward-modeled $w$CDM analysis of the KiDS-1000 weak lensing maps using graph-convolutional neural networks (GCNN). Utilizing the $\texttt{CosmoGrid}$, a novel massive simulation suite spanning six different cosmological parameters, we generate almost one million tomographic mock surveys on the sphere. Due to the large data set size and survey area, we perform a spherical analysis while limiting our map resolution to $\texttt{HEALPix}$ $n_\mathrm{side}=512$. We marginalize over systematics such as photometric redshift errors, multiplicative calibration and additive shear bias. Furthermore, we use a map-level implementation of the non-linear intrinsic alignment model along with a novel treatment of baryonic feedback to incorporate additional astrophysical nuisance parameters. We also perform a spherical power spectrum analysis for comparison. The constraints of the cosmological parameters are generated using a likelihood free inference method called Gaussian Process Approximate Bayesian Computation (GPABC). Finally, we check that our pipeline is robust against choices of the simulation parameters. We find constraints on the degeneracy parameter of $S_8 \equiv σ_8\sqrt{Ω_M/0.3} = 0.78^{+0.06}_{-0.06}$ for our power spectrum analysis and $S_8 = 0.79^{+0.05}_{-0.05}$ for our GCNN analysis, improving the former by 16%. This is consistent with earlier analyses of the 2-point function, albeit slightly higher. Baryonic corrections generally broaden the constraints on the degeneracy parameter by about 10%. These results offer great prospects for full machine learning based analyses of on-going and future weak lensing surveys.

preprint2022arXiv

Phenomenology of Double Descent in Finite-Width Neural Networks

`Double descent' delineates the generalization behaviour of models depending on the regime they belong to: under- or over-parameterized. The current theoretical understanding behind the occurrence of this phenomenon is primarily based on linear and kernel regression models -- with informal parallels to neural networks via the Neural Tangent Kernel. Therefore such analyses do not adequately capture the mechanisms behind double descent in finite-width neural networks, as well as, disregard crucial components -- such as the choice of the loss function. We address these shortcomings by leveraging influence functions in order to derive suitable expressions of the population loss and its lower bound, while imposing minimal assumptions on the form of the parametric model. Our derived bounds bear an intimate connection with the spectrum of the Hessian at the optimum, and importantly, exhibit a double descent behaviour at the interpolation threshold. Building on our analysis, we further investigate how the loss function affects double descent -- and thus uncover interesting properties of neural networks and their Hessian spectra near the interpolation threshold.

preprint2022arXiv

Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse

Transformers have achieved remarkable success in several domains, ranging from natural language processing to computer vision. Nevertheless, it has been recently shown that stacking self-attention layers - the distinctive architectural component of Transformers - can result in rank collapse of the tokens' representations at initialization. The question of if and how rank collapse affects training is still largely unanswered, and its investigation is necessary for a more comprehensive understanding of this architecture. In this work, we shed new light on the causes and the effects of this phenomenon. First, we show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish at initialization. Furthermore, we provide a thorough description of the origin of rank collapse and discuss how to prevent it via an appropriate depth-dependent scaling of the residual branches. Finally, our analysis unveils that specific architectural hyperparameters affect the gradients of queries and values differently, leading to disproportionate gradient norms. This suggests an explanation for the widespread use of adaptive methods for Transformers' optimization.

preprint2021arXiv

Momentum Improves Optimization on Riemannian Manifolds

We develop a new Riemannian descent algorithm that relies on momentum to improve over existing first-order methods for geodesically convex optimization. In contrast, accelerated convergence rates proved in prior work have only been shown to hold for geodesically strongly-convex objective functions. We further extend our algorithm to geodesically weakly-quasi-convex objectives. Our proofs of convergence rely on a novel estimate sequence that illustrates the dependency of the convergence rate on the curvature of the manifold. We validate our theoretical results empirically on several optimization problems defined on the sphere and on the manifold of positive definite matrices.

preprint2020arXiv

A Continuous-time Perspective for Modeling Acceleration in Riemannian Optimization

We propose a novel second-order ODE as the continuous-time limit of a Riemannian accelerated gradient-based method on a manifold with curvature bounded from below. This ODE can be seen as a generalization of the ODE derived for Euclidean spaces, and can also serve as an analysis tool. We study the convergence behavior of this ODE for different classes of functions, such as geodesically convex, strongly-convex and weakly-quasi-convex. We demonstrate how such an ODE can be discretized using a semi-implicit and Nesterov-inspired numerical integrator, that empirically yields stable algorithms which are faithful to the continuous-time analysis and exhibit accelerated convergence.

preprint2020arXiv

A domain agnostic measure for monitoring and evaluating GANs

Generative Adversarial Networks (GANs) have shown remarkable results in modeling complex distributions, but their evaluation remains an unsettled issue. Evaluations are essential for: (i) relative assessment of different models and (ii) monitoring the progress of a single model throughout training. The latter cannot be determined by simply inspecting the generator and discriminator loss curves as they behave non-intuitively. We leverage the notion of duality gap from game theory to propose a measure that addresses both (i) and (ii) at a low computational cost. Extensive experiments show the effectiveness of this measure to rank different GAN models and capture the typical GAN failure scenarios, including mode collapse and non-convergent behaviours. This evaluation metric also provides meaningful monitoring on the progression of the loss during training. It highly correlates with FID on natural image datasets, and with domain specific scores for text, sound and cosmology data where FID is not directly suitable. In particular, our proposed metric requires no labels or a pretrained classifier, making it domain agnostic.

preprint2020arXiv

An Accelerated DFO Algorithm for Finite-sum Convex Functions

Derivative-free optimization (DFO) has recently gained a lot of momentum in machine learning, spawning interest in the community to design faster methods for problems where gradients are not accessible. While some attention has been given to the concept of acceleration in the DFO literature, existing stochastic algorithms for objective functions with a finite-sum structure have not been shown theoretically to achieve an accelerated rate of convergence. Algorithms that use acceleration in such a setting are prone to instabilities, making it difficult to reach convergence. In this work, we exploit the finite-sum structure of the objective in order to design a variance-reduced DFO algorithm that provably yields acceleration. We prove rates of convergence for both smooth convex and strongly-convex finite-sum objective functions. Finally, we validate our theoretical results empirically on several tasks and datasets.

preprint2020arXiv

Batch Normalization Provably Avoids Rank Collapse for Randomly Initialised Deep Networks

Randomly initialized neural networks are known to become harder to train with increasing depth, unless architectural enhancements like residual connections and batch normalization are used. We here investigate this phenomenon by revisiting the connection between random initialization in deep networks and spectral instabilities in products of random matrices. Given the rich literature on random matrices, it is not surprising to find that the rank of the intermediate representations in unnormalized networks collapses quickly with depth. In this work we highlight the fact that batch normalization is an effective strategy to avoid rank collapse for both linear and ReLU networks. Leveraging tools from Markov chain theory, we derive a meaningful lower rank bound in deep linear networks. Empirically, we also demonstrate that this rank robustness generalizes to ReLU nets. Finally, we conduct an extensive set of experiments on real-world data sets, which confirm that rank stability is indeed a crucial condition for training modern-day deep neural architectures.

preprint2020arXiv

Continuous-time Models for Stochastic Optimization Algorithms

We propose new continuous-time formulations for first-order stochastic optimization algorithms such as mini-batch gradient descent and variance-reduced methods. We exploit these continuous-time models, together with simple Lyapunov analysis as well as tools from stochastic calculus, in order to derive convergence bounds for various types of non-convex functions. Guided by such analysis, we show that the same Lyapunov arguments hold in discrete-time, leading to matching rates. In addition, we use these models and Ito calculus to infer novel insights on the dynamics of SGD, proving that a decreasing learning rate acts as time warping or, equivalently, as landscape stretching.

preprint2020arXiv

The Role of Memory in Stochastic Optimization

The choice of how to retain information about past gradients dramatically affects the convergence properties of state-of-the-art stochastic optimization methods, such as Heavy-ball, Nesterov's momentum, RMSprop and Adam. Building on this observation, we use stochastic differential equations (SDEs) to explicitly study the role of memory in gradient-based algorithms. We first derive a general continuous-time model that can incorporate arbitrary types of memory, for both deterministic and stochastic settings. We provide convergence guarantees for this SDE for weakly-quasi-convex and quadratically growing functions. We then demonstrate how to discretize this SDE to get a flexible discrete-time algorithm that can implement a board spectrum of memories ranging from short- to long-term. Not only does this algorithm increase the degrees of freedom in algorithmic choice for practitioners but it also comes with better stability properties than classical momentum in the convex stochastic setting. In particular, no iterate averaging is needed for convergence. Interestingly, our analysis also provides a novel interpretation of Nesterov's momentum as stable gradient amplification and highlights a possible reason for its unstable behavior in the (convex) stochastic setting. Furthermore, we discuss the use of long term memory for second-moment estimation in adaptive methods, such as Adam and RMSprop. Finally, we provide an extensive experimental study of the effect of different types of memory in both convex and nonconvex settings.