Researcher profile

Léopold Cambier

Léopold Cambier contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2021arXiv

Towards a Scalable Hierarchical High-order CFD Solver

Development of highly scalable and robust algorithms for large-scale CFD simulations has been identified as one of the key ingredients to achieve NASA's CFD Vision 2030 goals. In order to improve simulation capability and to effectively leverage new high-performance computing hardware, the most computationally intensive parts of CFD solution algorithms -- namely, linear solvers and preconditioners -- need to achieve asymptotic behavior on massively parallel and heterogeneous architectures and preserve convergence rates as the meshes are refined further. In this work, we present a scalable high-order implicit Discontinuous Galerkin solver from the SU2 framework using a promising preconditioning technique based on algebraic sparsified nested dissection algorithm with low-rank approximations, and communication-avoiding Krylov subspace methods to enable scalability with very large processor counts. The overall approach is tested on a canonical 2D NACA0012 test case of increasing size to demonstrate its scalability on multiple processing cores. Both the preconditioner and the linear solver are shown to exhibit near-linear weak scaling up to 2,048 cores with no significant degradation of the convergence rate.

preprint2020arXiv

An Algebraic Sparsified Nested Dissection Algorithm Using Low-Rank Approximations

We propose a new algorithm for the fast solution of large, sparse, symmetric positive-definite linear systems, spaND -- sparsified Nested Dissection. It is based on nested dissection, sparsification and low-rank compression. After eliminating all interiors at a given level of the elimination tree, the algorithm sparsifies all separators corresponding to the interiors. This operation reduces the size of the separators by eliminating some degrees of freedom but without introducing any fill-in. This is done at the expense of a small and controllable approximation error. The result is an approximate factorization that can be used as an efficient preconditioner. We then perform several numerical experiments to evaluate this algorithm. We demonstrate that a version using orthogonal factorization and block-diagonal scaling takes fewer CG iterations to converge than previous similar algorithms on various kinds of problems. Furthermore, this algorithm is provably guaranteed to never break down and the matrix stays symmetric positive-definite throughout the process. We evaluate the algorithm on some large problems and show it exhibits near-linear scaling. The factorization time is roughly O(N) and the number of iterations grows slowly with N.

preprint2020arXiv

Second Order Accurate Hierarchical Approximate Factorization of Sparse SPD Matrices

We describe a second-order accurate approach to sparsifying the off-diagonal blocks in the hierarchical approximate factorizations of sparse symmetric positive definite matrices. The norm of the error made by the new approach depends quadratically, not linearly, on the error in the low-rank approximation of the given block. The analysis of the resulting two-level preconditioner shows that the preconditioner is second-order accurate as well. We incorporate the new approach into the recent Sparsified Nested Dissection algorithm [SIAM J. Matrix Anal. Appl., 41 (2020), pp. 715-746], and test it on a wide range of problems. The new approach halves the number of Conjugate Gradient iterations needed for convergence, with almost the same factorization complexity, improving the total runtimes of the algorithm. Our approach can be incorporated into other rank-structured methods for solving sparse linear systems.

preprint2020arXiv

Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks

Training with larger number of parameters while keeping fast iterations is an increasingly adopted strategy and trend for developing better performing Deep Neural Network (DNN) models. This necessitates increased memory footprint and computational requirements for training. Here we introduce a novel methodology for training deep neural networks using 8-bit floating point (FP8) numbers. Reduced bit precision allows for a larger effective memory and increased computational speed. We name this method Shifted and Squeezed FP8 (S2FP8). We show that, unlike previous 8-bit precision training methods, the proposed method works out-of-the-box for representative models: ResNet-50, Transformer and NCF. The method can maintain model accuracy without requiring fine-tuning loss scaling parameters or keeping certain layers in single precision. We introduce two learnable statistics of the DNN tensors - shifted and squeezed factors that are used to optimally adjust the range of the tensors in 8-bits, thus minimizing the loss in information due to quantization.

preprint2020arXiv

TaskTorrent: a Lightweight Distributed Task-Based Runtime System in C++

We present TaskTorrent, a lightweight distributed task-based runtime in C++. TaskTorrent uses a parametrized task graph to express the task DAG, and one-sided active messages to trigger remote tasks asynchronously. As a result the task DAG is completely distributed and discovered in parallel. It is a C++14 library and only depends on MPI. We explain the API and the implementation. We perform a series of benchmarks against StarPU and ScaLAPACK. Micro benchmarks show it has a minimal overhead compared to other solutions. We then apply it to two large linear algebra problems. TaskTorrent scales very well to thousands of cores, exhibiting good weak and strong scalings.

preprint2020arXiv

The Index of Invariance and its Implications for a Parameterized Least Squares Problem

We study the problem $x_{b,ω} := \text{arg min}_{x \in \mathcal{S}} \|(A + ωI)^{-1/2} (b - Ax)\|_2$, with $A = A^*$, for a subspace $\mathcal{S}$ of $\mathbb{F}^n$ ($\mathbb{F} = \mathbb{R}$ or $\mathbb{C}$), and $ω> -λ_{min}(A)$. We show that there exists a subspace $\mathcal{Y}$ of $\mathbb{F}^n$, independent of $b$, such that $\{x_{b,ω} - x_{b,μ} \mid ω,μ> -λ_{min}(A)\} \subseteq \mathcal{Y}$, where $\dim(\mathcal{Y}) \leq \dim(\mathcal{S} + A\mathcal{S}) - \dim(\mathcal{S}) = \mathbf{Ind}_A(\mathcal{S})$, a quantity which we call the index of invariance of $\mathcal{S}$ with respect to $A$. In particular if $\mathcal{S}$ is a Krylov subspace, this implies the low dimensionality result of Hallman & Gu (2018). The problem is also such that when $A$ is positive and $\mathcal{S}$ is a Krylov subspace, it reduces to CG for $ω= 0$ and to MINRES for $ω\to \infty$. We study several properties of $\mathbf{Ind}_A(\mathcal{S})$ in relation to $A$ and $\mathcal{S}$. We show that the dimension of the affine subspace $\mathcal{X}_b$ containing the solutions $x_{b,ω}$ can be smaller than $\mathbf{Ind}_A(\mathcal{S})$ for all $b$. However, we also exhibit some sufficient conditions on $A$ and $\mathcal{S}$, under which $\mathcal{X} := \text{Span}{\{x_{b,ω} - x_{b,μ} \mid b \in \mathbb{F}^n, ω,μ> -λ_{min}(A)\}}$ has dimension equal to $\mathbf{Ind}_A(\mathcal{S})$. We then study the injectivity of the map $ω\mapsto x_{b,ω}$, leading us to a proof of the convexity result from Hallman & Gu (2018). We finish by showing that sets such as $M(\mathcal{S},\mathcal{S}') = \{A \in \mathbb{F}^{n \times n} \mid \mathcal{S} + A\mathcal{S} = \mathcal{S}'\}$, for nested subspaces $\mathcal{S} \subseteq \mathcal{S}' \subseteq \mathbb{F}^n$, form smooth real manifolds, and explore some topological relationships between them.

preprint2019arXiv

Fast Low-Rank Kernel Matrix Factorization through Skeletonized Interpolation

Integral equations are commonly encountered when solving complex physical problems. Their discretization leads to a dense kernel matrix that is block or hierarchically low-rank. This paper proposes a new way to build a low-rank factorization of those low-rank blocks at a nearly optimal cost of $\mathcal{O}(nr)$ for a $n \times n$ block submatrix of rank r. This is done by first sampling the kernel function at new interpolation points, then selecting a subset of those using a CUR decomposition and finally using this reduced set of points as pivots for a RRLU-type factorization. We also explain how this implicitly builds an optimal interpolation basis for the Kernel under consideration. We show the asymptotic convergence of the algorithm, explain his stability and demonstrate on numerical examples that it performs very well in practice, allowing to obtain rank nearly equal to the optimal rank at a fraction of the cost of the naive algorithm.