Researcher profile

O. Anatole von Lilienfeld

O. Anatole von Lilienfeld contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
22works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

22 published item(s)

preprint2022arXiv

Ab initio machine learning of phase space averages

Equilibrium structures determine material properties and biochemical functions. We propose to machine learn phase-space averages, conventionally obtained by {\em ab initio} or force-field based molecular dynamics (MD) or Monte Carlo simulations. In analogy to \textit(ab initio} molecular dynamics (AIMD), our {\em ab initio} machine learning (AIML) model does not require bond topologies and therefore enables a general machine learning pathway to ensemble properties throughout chemical compound space. We demonstrate AIML for predicting Boltzmann averaged structures after training on hundreds of MD trajectories. AIML output is subsequently used to train machine learning models of free energies of solvation using experimental data, and reaching competitive prediction errors (MAE $\sim$ 0.8 kcal/mol) for out-of-sample molecules -- within milli-seconds. As such, AIML effectively bypasses the need for MD or MC-based phase space sampling, enabling exploration campaigns throughout CCS at a much accelerated pace. We contextualize our findings by comparison to state-of-the-art methods resulting in a Pareto plot for the free energy of solvation predictions in terms of accuracy and time.

preprint2022arXiv

Alchemical geometry relaxation

We propose to relax geometries throughout chemical compound space (CCS) using alchemical perturbation density functional theory (APDFT). APDFT refers to perturbation theory involving changes in nuclear charges within approximate solutions to Schrödinger's equation. We give an analytical formula to calculate the mixed second order energy derivatives with respect to both, nuclear charges and nuclear positions (named "alchemical force"), within the restricted Hartree-Fock case. We have implemented and studied the formula for its use in geometry relaxation of various reference and target molecules. We have also analysed the convergence of the alchemical force perturbation series, as well as basis set effects. Interpolating alchemically predicted energies, forces, and Hessian to a Morse potential yields more accurate geometries and equilibrium energies than when performing a standard Newton Raphson step. Our numerical predictions for small molecules including BF, CO, N2, CH$_4$, NH$_3$, H$_2$O, and HF yield mean absolute errors of of equilibrium energies and bond lengths smaller than 10 mHa and 0.01 Bohr for 4$^\text{th}$ order APDFT predictions, respectively. Our alchemical geometry relaxation still preserves the combinatorial efficiency of APDFT: Based on a single coupled perturbed Hartree Fock derivative for benzene we provide numerical predictions of equilibrium energies and relaxed structures of all the 17 iso-electronic charge-netural BN-doped mutants with averaged absolute deviations of $\sim$27 mHa and $\sim$0.12 Bohr, respectively.

preprint2022arXiv

An orbital-based representation for accurate Quantum Machine Learning

We introduce an electronic structure based representation for quantum machine learning (QML) of electronic properties throughout chemical compound space. The representation is constructed using computationally inexpensive ab initio calculations and explicitly accounts for changes in the electronic structure. We demonstrate the accuracy and flexibility of resulting QML models when applied to property labels such as total potential energy, HOMO and LUMO energies, ionization potential, and electron affinity, using as data sets for training and testing entries from the QM7b, QM7b-T, QM9, and LIBE libraries. For the latter, we also demonstrate the ability of this approach to account for molecular species of different charge and spin multiplicity, resulting in QML models that infer total potential energies based on geometry, charge, and spin as input.

preprint2021arXiv

Elucidating atmospheric brown carbon -- Supplanting chemical intuition with exhaustive enumeration and machine learning

To unravel the structures of C12H12O7 isomers, identified as light-absorbing photooxidation products of syringol in atmospheric chamber experiments, we apply a graph-based molecule generator and machine learning workflow. To accomplish this in a bias-free manner, molecular graphs of the entire chemical subspace of C12H12O7 were generated, assuming that the isomers contain two C6-rings; this led to 260 million molecular graphs and 120 million stable structures. Using quantum chemistry excitation energies and oscillator strengths as training data, we predicted these quantities using kernel ridge regression and simulated UV/Vis absorption spectra. Then we determined the probability of the molecules to cause the experimental spectrum within the errors of the different methods. Molecules whose spectra were likely to match the experimental spectrum were clustered according to structural features, resulting in clusters of > 500,000 molecules. While we identified several features that correlate with a high probability to cause the experimental spectrum, no clear composition of necessary features can be given. Thus, the absorption spectrum is not sufficient to uniquely identify one specific isomer structure. If more structural features were known from experimental data, the number of structures could be reduced to a few tens of thousands candidates. We offer a procedure to detect when sufficient fragmentation data has been included to reduce the number of possible molecules. The most efficient strategy to obtain valid candidates is obtained if structural data is applied already at the bias-free molecule generation stage. The systematic enumeration, however, is necessary to avoid mis-identification of molecules, while it guarantees that there are no other molecules that would also fit the spectrum in question.

preprint2021arXiv

Simplifying inverse material design problems for fixed lattices with alchemical chirality

Massive brute-force compute campaigns relying on demanding ab initio calculations routinely search for novel materials in chemical compound space, the vast virtual set of all conceivable stable combinations of elements and structural configurations which form matter. Here we demonstrate that 4-dimensional chirality, arising from anti-symmetry of alchemical perturbations, dissects that space and defines approximate ranks which effectively reduce its formal dimensionality, and enable us to break down its combinatorial scaling. The resulting distinct `alchemical' enantiomers must share the exact same electronic energy up to third order -- independent of respective covalent bond topology, and imposing relevant constraints on chemical bonding. Alchemical chirality deepens our understanding of chemical compound space and enables the `on-the-fly' establishment of new trends without empiricism for any materials with fixed lattices. We demonstrate its efficacy for three such cases: i) new formulas for estimating electronic energy contributions to chemical bonding; ii) analysis of the perturbed electron density of BN doped benzene; and iii) ranking stability estimates for BN doping in over 2,000 naphthalene and over 400 million picene derivatives.

preprint2020arXiv

Data Enhanced Reaction Predictions in Chemical Space With Hammett's Equation

By separating the effect of substituents from chemical process variables, such as reaction mechanism, solvent, or temperature, the Hammett equation enables control of chemical reactivity throughout chemical space. We used global regression to optimize Hammett parameters $ρ$ and $σ$ in two datasets, experimental rate constants for benzylbromides reacting with thiols and the decomposition of ammonium salts, and a synthetic dataset consisting of computational activation energies of $\sim$ 1400 $S_N2$ reactions, with various nucleophiles and leaving groups (-H, -F, -Cl, -Br) and functional groups (-H, -NO$_2$, -CN, -NH$_3$, -CH$_3$). The original approach is generalized to predict potential energies of activation in non aromatic molecular scaffolds with multiple substituents. Individual substituents contribute additively to molecular $σ$ with a unique regression term, which quantifies the inductive effect. Moreover, the position dependence of the substituent can be replaced by a distance decaying factor for $S_N2$. Use of the Hammett equation as a base-line model for $Δ$-Machine learning models of the activation energy in chemical space results in substantially improved learning curves for small training set sizes.

preprint2020arXiv

Dictionary of 140k GDB and ZINC derived AMONs

We present all {\bf A}mons for {\bf G}DB and {\bf Z}inc data-bases using no more than 7 non-hydrogen atoms (AGZ7)---a calculated organic chemistry building-block dictionary based on the AMON approach [Huang and von Lilienfeld, {\em Nature Chemistry} (2020)]. AGZ7 records Cartesian coordinates of compositional and constitutional isomers, as well as properties for $\sim$140k small organic molecules obtained by systematically fragmenting all molecules of Zinc and the majority of GDB17 into smaller entities, saturating with hydrogens, and containing no more than 7 heavy atoms (excluding hydrogen atoms). AGZ7 cover the elements \{H, B, C, N, O, F, Si, P, S, Cl, Br, Sn and I\} and includes optimized geometries, total energy and its decomposition, Mulliken atomic charges, dipole moment vectors, quadrupole tensors, electronic spatial extent, eigenvalues of all occupied orbitals, LUMO, gap, isotropic polarizability, harmonic frequencies, reduced masses, force constants, IR intensity, normal coordinates, rotational constants, zero-point energy, internal energy, enthalpy, entropy, free energy, and heat capacity (all at ambient conditions) using B3LYP/cc-pVTZ (pseudopotentials were used for Sn and I) level of theory. We exemplify the usefulness of this data set with AMON based machine learning models of total potential energy predictions of seven of the most rigid GDB-17 molecules.

preprint2020arXiv

FCHL revisited: faster and more accurate quantum machine learning

We introduce the FCHL19 representation for atomic environments in molecules or condensed-phase systems. Machine learning models based on FCHL19 are able to yield predictions of atomic forces and energies of query compounds with chemical accuracy on the scale of milliseconds. FCHL19 is a revision of our previous work [Faber et al. 2018] where the representation is discretized and the individual features are rigorously optimized using Monte Carlo optimization. Combined with a Gaussian kernel function that incorporates elemental screening, chemical accuracy is reached for energy learning on the QM7b and QM9 datasets after training for minutes and hours, respectively. The model also shows good performance for non-bonded interactions in the condensed phase for a set of water clusters with an MAE binding energy error of less than 0.1 kcal/mol/molecule after training on 3,200 samples. For force learning on the MD17 dataset, our optimized model similarly displays state-of-the-art accuracy with a regressor based on Gaussian process regression. When the revised FCHL19 representation is combined with the operator quantum machine learning regressor, forces and energies can be predicted in only a few milliseconds per atom. The model presented herein is fast and lightweight enough for use in general chemistry problems as well as molecular dynamics simulations.

preprint2020arXiv

Large yet bounded: Spin gap ranges in carbenes

Despite its relevance for chemistry, the electronic structure of free carbenes throughout chemical space has not yet been studied in a systematic manner. We explore a large and systematic carbene chemical space consisting of eight thousand diverse and common carbene scaffolds in their singlet and triplet state computed at controlled accuracy (higher order multireference level of theory) and with verified carbene character in the electronic structure. Originating in strong electron correlation, a hard upper limit for the singlet-triplet gap is found to emerge at around 30 kcal/mol for all the carbene classes in this chemical space. We also observe large vertical and adiabatic spin gap ranges within many carbene classes ($>$100 and $>$60 kcal/mol, respectively), and we report novel relationships between compositional, structural, and electronic degrees of freedom. Our QMspin data base includes numerical results for $\approx$13'000 MRCI calculations on randomly selected carbene scaffolds.

preprint2020arXiv

On the role of gradients for machine learning of molecular energies and forces

The accuracy of any machine learning potential can only be as good as the data used in the fitting process. The most efficient model therefore selects the training data that will yield the highest accuracy compared to the cost of obtaining the training data. We investigate the convergence of prediction errors of quantum machine learning models for organic molecules trained on energy and force labels, two common data types in molecular simulations. When training and predicting on different geometries corresponding to the same single molecule, we find that the inclusion of atomic forces in the training data increases the accuracy of the predicted energies and forces 7-fold, compared to models trained on energy only. Surprisingly, for models trained on sets of organic molecules of varying size and composition in non-equilibrium conformations, inclusion of forces in the training does not improve the predicted energies of unseen molecules in new conformations. Predicted forces, however, also improve about 7-fold. For the systems studied, we find that force labels and energy labels contribute equally per label to the convergence of the prediction errors. Choosing to include derivatives such as atomic forces in the training set or not should thus depend on, not only on the computational cost of acquiring the force labels for training, but also on the application domain, the property of interest, and the desirable size of the machine learning model. Based on our observations we describe key considerations for the creation of datasets for potential energy surfaces of molecules which maximize the efficiency of the resulting machine learning models.

preprint2020arXiv

Quantum machine learning using atom-in-molecule-based fragments selected on-the-fly

First principles based exploration of chemical space deepens our understanding of chemistry, and might help with the design of new materials or experiments. Due to the computational cost of quantum chemistry methods and the immens number of theoretically possible stable compounds comprehensive in-silico screening remains prohibitive. To overcome this challenge, we combine atoms-in-molecules based fragments, dubbed "amons" (A), with active learning in transferable quantum machine learning (ML) models. The efficiency, accuracy, scalability, and transferability of resulting AML models is demonstrated for important molecular quantum properties, such as energies, forces, atomic charges NMR shifts, polarizabilities, and for systems ranging from organic molecules over 2D materials and water clusters to Watson-Crick DNA base-pairs and even ubiquitin. Conceptually, the AML approach extends Mendeleev's table to effectively account for chemical environments, which allows the systematic reconstruction of many chemistries from local building blocks.

preprint2020arXiv

Quantum-chemistry-aided identification, synthesis and experimental validation of model systems for conformationally controlled reaction studies: Separation of the conformers of 2,3-dibromobuta-1,3-diene in the gas phase

The Diels-Alder cycloaddition, in which a diene reacts with a dienophile to form a cyclic compound, counts among the most important tools in organic synthesis. Achieving a precise understanding of its mechanistic details on the quantum level requires new experimental and theoretical methods. Here, we present an experimental approach that separates different diene conformers in a molecular beam as a prerequisite for the investigation of their individual cycloaddition reaction kinetics and dynamics under single-collision conditions in the gas phase. A low- and high-level quantum-chemistry-based screening of more than one hundred dienes identified 2,3-dibromobutadiene (DBB) as an optimal candidate for efficient separation of its gauche and s-trans conformers by electrostatic deflection. A preparation method for DBB was developed which enabled the generation of dense molecular beams of this compound. The theoretical predictions of the molecular properties of DBB were validated by the successful separation of the conformers in the molecular beam. A marked difference in photofragment ion yields of the two conformers upon femtosecond-laser pulse ionization was observed, pointing at a pronounced conformer-specific fragmentation dynamics of ionized DBB. Our work sets the stage for a rigorous examination of mechanistic models of cycloaddition reactions under controlled conditions in the gas phase.

preprint2020arXiv

Thousands of reactants and transition states for competing E2 and S$_\text{N}$2 reactions

Reaction barriers are a crucial ingredient for first principles based computational retro-synthesis efforts as well as for comprehensive reactivity assessments throughout chemical compound space. While extensive databases of experimental results exist, modern quantum machine learning applications require atomistic details which can only be obtained from quantum chemistry protocols. For competing E2 and S$_\text{N}$2 reaction channels we report 4'466 transition state and 143'200 reactant complex geometries and energies at respective MP2/6-311G(d) and single point DF-LCCSD/cc-pVTZ level of theory covering the chemical compound space spanned by the substituents NO$_2$, CN, CH$_3$, and NH$_2$ and early halogens (F, Cl, Br) as nucleophiles and leaving groups. Reactants are chosen such that the activation energy of the competing E2 and S$_\text{N}$2 reactions are of comparable magnitude. The correct concerted motion for each of the one-step reactions has been validated for all transition states. We demonstrate how quantum machine learning models can support data set extension, and discuss the distribution of key internal coordinates of the transition states.

preprint2019arXiv

Alchemical perturbation density functional theory (APDFT)

We introduce an orbital free electron density functional approximation based on alchemical perturbation theory. Given convergent perturbations of a suitable reference system, the accuracy of popular self-consistent Kohn-Sham density functional estimates of properties of new molecules can be systematically surpassed---at negligible cost. The associated energy functional is an approximation to the integrated energy derivative, requiring only perturbed reference electron densities: No self-consistent field equations are necessary to estimate energies and electron densities. Electronic ground state properties considered include covalent bonding potentials, atomic forces, as well as dipole and quadropole moments.

preprint2019arXiv

Atoms in molecules from alchemical perturbation density functional theory

Based on thermodynamic integration we introduce atoms in molecules (AIM) using the orbital-free framework of alchemical perturbation density functional theory (APDFT). Within APDFT, atomic energies and electron densities in molecules are arbitrary because any arbitrary reference system and integration path can be selected as long as it meets the boundary conditions. We choose the uniform electron gas as the most generic reference, and linearly scale up all nuclear charges, situated at any query molecule's atomic coordinates. Within the approximations made when calculating one-particle electron densities, this choice affords exact and unambiguous definitions of energies and electron densities of AIMs We illustrate the approach for neutral iso-electronic diatomics (CO, N$_2$, BF), various small molecules with different electronic hybridisation states of carbon (CH$_4$, C$_2$H$_6$, C$_2$H$_4$, C$_2$H$_2$, HCN), and for all the possible BN doped mutants connecting benzene to borazine (C$_{2n}$B$_{3-n}$N$_{3-n}$H$_6$, $0 \le n \le 3$). Analysis of the numerical results obtained suggests that APDFT based AIMs enable meaningful and new interpretations of molecular energies and electron densities.

preprint2019arXiv

Machine learning the computational cost of quantum chemistry

Computational quantum mechanics based molecular and materials design campaigns consume increasingly more high-performance compute resources, making improved job scheduling efficiency desirable in order to reduce carbon footprint or wasteful spending. We introduce quantum machine learning (QML) models of the computational cost of common quantum chemistry tasks. For 2D non-linear toy systems, single point, geometry optimization, and transition state calculations the out of sample prediction error of QML models of wall times decays systematically with training set size. We present numerical evidence for a toy system containing two functions and three commonly used optimizer and for thousands of organic molecular systems including closed and open shell equilibrium structures, as well as transition states. Levels of electronic structure theory considered include B3LYP/def2-TZVP, MP2/6-311G(d), local CCSD(T)/VTZ-F12, CASSCF/VDZ-F12, and MRCISD+Q-F12/VDZ-F12. In comparison to conventional indiscriminate job treatment, QML based wall time predictions significantly improve job scheduling efficiency for all tasks after training on just thousands of molecules. Resulting reductions in CPU time overhead range from 10% to 90%.

preprint2017arXiv

Machine learning prediction errors better than DFT accuracy

We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of thirteen electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out-of-sample errors as a function of training set size with up to $\sim$117k distinct molecules. Molecular structures and properties at hybrid density functional theory (DFT) level of theory used for training and testing come from the QM9 database [Ramakrishnan et al, {\em Scientific Data} {\bf 1} 140022 (2014)] and include dipole moment, polarizability, HOMO/LUMO energies and gap, electronic spatial extent, zero point vibrational energy, enthalpies and free energies of atomization, heat capacity and the highest fundamental vibrational frequency. Various representations from the literature have been studied (Coulomb matrix, bag of bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed distribution based variants including histograms of distances (HD), and angles (HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian ridge regression (BR) and linear regression with elastic net regularization (EN)), random forest (RF), kernel ridge regression (KRR) and two types of neural net works, graph convolutions (GC) and gated graph networks (GG). We present numerical evidence that ML model predictions deviate from DFT less than DFT deviates from experiment for all properties. Furthermore, our out-of-sample prediction errors with respect to hybrid DFT reference are on par with, or close to, chemical accuracy. Our findings suggest that ML models could be more accurate than hybrid DFT if explicitly electron correlated quantum (or experimental) data was available.

preprint2015arXiv

Crystal Structure Representations for Machine Learning Models of Formation Energies

We introduce and evaluate a set of feature vector representations of crystal structures for machine learning (ML) models of formation energies of solids. ML models of atomization energies of organic molecules have been successful using a Coulomb matrix representation of the molecule. We consider three ways to generalize such representations to periodic systems: (i) a matrix where each element is related to the Ewald sum of the electrostatic interaction between two different atoms in the unit cell repeated over the lattice; (ii) an extended Coulomb-like matrix that takes into account a number of neighboring unit cells; and (iii) an Ansatz that mimics the periodicity and the basic features of the elements in the Ewald sum matrix by using a sine function of the crystal coordinates of the atoms. The representations are compared for a Laplacian kernel with Manhattan norm, trained to reproduce formation energies using a data set of 3938 crystal structures obtained from the Materials Project. For training sets consisting of 3000 crystals, the generalization error in predicting formation energies of new structures corresponds to (i) 0.49, (ii) 0.64, and (iii) 0.37 eV/atom for the respective representations.

preprint2015arXiv

Fourier series of atomic radial distribution functions: A molecular fingerprint for machine learning models of quantum chemical properties

We introduce a fingerprint representation of molecules based on a Fourier series of atomic radial distribution functions. This fingerprint is unique (except for chirality), continuous, and differentiable with respect to atomic coordinates and nuclear charges. It is invariant with respect to translation, rotation, and nuclear permutation, and requires no pre-conceived knowledge about chemical bonding, topology, or electronic orbitals. As such it meets many important criteria for a good molecular representation, suggesting its usefulness for machine learning models of molecular properties trained across chemical compound space. To assess the performance of this new descriptor we have trained machine learning models of molecular enthalpies of atomization for training sets with up to 10k organic molecules, drawn at random from a published set of 134k organic molecules. We validate the descriptor on all remaining molecules of the 134k set. For a training set of 5k molecules the fingerprint descriptor achieves a mean absolute error of 8.0 kcal/mol, respectively. This is slightly worse than the performance attained using the Coulomb matrix, another popular alternative, reaching 6.2 kcal/mol for the same training and test sets.

preprint2015arXiv

Many Molecular Properties from One Kernel in Chemical Space

We introduce property-independent kernels for machine learning modeling of arbitrarily many molecular properties. The kernels encode molecular structures for training sets of varying size, as well as similarity measures sufficiently diffuse in chemical space to sample over all training molecules. Corresponding molecular reference properties provided, they enable the instantaneous generation of ML models which can systematically be improved through the addition of more data. This idea is exemplified for single kernel based modeling of internal energy, enthalpy, free energy, heat capacity, polarizability, electronic spread, zero-point vibrational energy, energies of frontier orbitals, HOMO-LUMO gap, and the highest fundamental vibrational wavenumber. Models of these properties are trained and tested using 112 kilo organic molecules of similar size. Resulting models are discussed as well as the kernels' use for generating and using other property models.

preprint2015arXiv

Quantum Mechanical Treatment of Variable Molecular Composition: From "Alchemical" Changes of State Functions to Rational Compound Design

"Alchemical" interpolation paths, i.e.~coupling systems along fictitious paths that without realistic correspondence, are frequently used within materials and molecular modeling and simulation protocols for the estimation of relative changes in state functions such as free energies. We discuss alchemical changes in the context of quantum chemistry, and present illustrative numerical results for the changes of HOMO eigenvalues of the He atom due to a linear alchemical teleportation---the simultaneous annihilation and creation of nuclear charges at different locations. To demonstrate the predictive power of alchemical first order derivatives (Hellmann-Feynman) the covalent bond potential of hydrogen fluoride and hydrogen chloride is investigated, as well as the van-der-Waals binding in the water-water and water-hydrogen fluoride dimer, respectively. Based on converged electron densities for one configuration, the versatility of alchemical derivatives is exemplified for the screening of entire binding potentials with reasonable accuracy. Finally, we discuss constraints for the identification of non-linear coupling potentials for which the energy's Hellmann-Feynman derivative will yield accurate predictions.

preprint2015arXiv

Water on hexagonal boron nitride from diffusion Monte Carlo

Despite a recent flurry of experimental and simulation studies, an accurate estimate of the interaction strength of water molecules with hexagonal boron nitride is lacking. Here we report quantum Monte Carlo results for the adsorption of a water monomer on a periodic hexagonal boron nitride sheet, which yield a water monomer interaction energy of -84 +/- 5 meV. We use the results to evaluate the performance of several widely used density functional theory (DFT) exchange correlation functionals, and find that they all deviate substantially. Differences in interaction energies between different adsorption sites are however better reproduced by DFT.