Source author record

O. Anatole von Lilienfeld

O. Anatole von Lilienfeld appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

physics.chem-ph cond-mat.mtrl-sci Machine Learning physics.comp-ph cond-mat.dis-nn cond-mat.str-el cond-mat.mes-hall cond-mat.other physics.atm-clus quant-ph

Catalog footprint

What is connected

39works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Ab initio machine learning of phase space averages

Equilibrium structures determine material properties and biochemical functions. We propose to machine learn phase-space averages, conventionally obtained by {\em ab initio} or force-field based molecular dynamics (MD) or Monte Carlo simulations. In analogy to \textit(ab initio} molecular dynamics (AIMD), our {\em ab initio} machine learning (AIML) model does not require bond topologies and therefore enables a general machine learning pathway to ensemble properties throughout chemical compound space. We demonstrate AIML for predicting Boltzmann averaged structures after training on hundreds of MD trajectories. AIML output is subsequently used to train machine learning models of free energies of solvation using experimental data, and reaching competitive prediction errors (MAE $\sim$ 0.8 kcal/mol) for out-of-sample molecules -- within milli-seconds. As such, AIML effectively bypasses the need for MD or MC-based phase space sampling, enabling exploration campaigns throughout CCS at a much accelerated pace. We contextualize our findings by comparison to state-of-the-art methods resulting in a Pareto plot for the free energy of solvation predictions in terms of accuracy and time.

preprint2022arXiv

Alchemical geometry relaxation

We propose to relax geometries throughout chemical compound space (CCS) using alchemical perturbation density functional theory (APDFT). APDFT refers to perturbation theory involving changes in nuclear charges within approximate solutions to Schrödinger's equation. We give an analytical formula to calculate the mixed second order energy derivatives with respect to both, nuclear charges and nuclear positions (named "alchemical force"), within the restricted Hartree-Fock case. We have implemented and studied the formula for its use in geometry relaxation of various reference and target molecules. We have also analysed the convergence of the alchemical force perturbation series, as well as basis set effects. Interpolating alchemically predicted energies, forces, and Hessian to a Morse potential yields more accurate geometries and equilibrium energies than when performing a standard Newton Raphson step. Our numerical predictions for small molecules including BF, CO, N2, CH$_4$, NH$_3$, H$_2$O, and HF yield mean absolute errors of of equilibrium energies and bond lengths smaller than 10 mHa and 0.01 Bohr for 4$^\text{th}$ order APDFT predictions, respectively. Our alchemical geometry relaxation still preserves the combinatorial efficiency of APDFT: Based on a single coupled perturbed Hartree Fock derivative for benzene we provide numerical predictions of equilibrium energies and relaxed structures of all the 17 iso-electronic charge-netural BN-doped mutants with averaged absolute deviations of $\sim$27 mHa and $\sim$0.12 Bohr, respectively.

preprint2022arXiv

An orbital-based representation for accurate Quantum Machine Learning

We introduce an electronic structure based representation for quantum machine learning (QML) of electronic properties throughout chemical compound space. The representation is constructed using computationally inexpensive ab initio calculations and explicitly accounts for changes in the electronic structure. We demonstrate the accuracy and flexibility of resulting QML models when applied to property labels such as total potential energy, HOMO and LUMO energies, ionization potential, and electron affinity, using as data sets for training and testing entries from the QM7b, QM7b-T, QM9, and LIBE libraries. For the latter, we also demonstrate the ability of this approach to account for molecular species of different charge and spin multiplicity, resulting in QML models that infer total potential energies based on geometry, charge, and spin as input.

preprint2021arXiv

Elucidating atmospheric brown carbon -- Supplanting chemical intuition with exhaustive enumeration and machine learning

To unravel the structures of C12H12O7 isomers, identified as light-absorbing photooxidation products of syringol in atmospheric chamber experiments, we apply a graph-based molecule generator and machine learning workflow. To accomplish this in a bias-free manner, molecular graphs of the entire chemical subspace of C12H12O7 were generated, assuming that the isomers contain two C6-rings; this led to 260 million molecular graphs and 120 million stable structures. Using quantum chemistry excitation energies and oscillator strengths as training data, we predicted these quantities using kernel ridge regression and simulated UV/Vis absorption spectra. Then we determined the probability of the molecules to cause the experimental spectrum within the errors of the different methods. Molecules whose spectra were likely to match the experimental spectrum were clustered according to structural features, resulting in clusters of > 500,000 molecules. While we identified several features that correlate with a high probability to cause the experimental spectrum, no clear composition of necessary features can be given. Thus, the absorption spectrum is not sufficient to uniquely identify one specific isomer structure. If more structural features were known from experimental data, the number of structures could be reduced to a few tens of thousands candidates. We offer a procedure to detect when sufficient fragmentation data has been included to reduce the number of possible molecules. The most efficient strategy to obtain valid candidates is obtained if structural data is applied already at the bias-free molecule generation stage. The systematic enumeration, however, is necessary to avoid mis-identification of molecules, while it guarantees that there are no other molecules that would also fit the spectrum in question.

preprint2021arXiv

Simplifying inverse material design problems for fixed lattices with alchemical chirality

Massive brute-force compute campaigns relying on demanding ab initio calculations routinely search for novel materials in chemical compound space, the vast virtual set of all conceivable stable combinations of elements and structural configurations which form matter. Here we demonstrate that 4-dimensional chirality, arising from anti-symmetry of alchemical perturbations, dissects that space and defines approximate ranks which effectively reduce its formal dimensionality, and enable us to break down its combinatorial scaling. The resulting distinct `alchemical' enantiomers must share the exact same electronic energy up to third order -- independent of respective covalent bond topology, and imposing relevant constraints on chemical bonding. Alchemical chirality deepens our understanding of chemical compound space and enables the `on-the-fly' establishment of new trends without empiricism for any materials with fixed lattices. We demonstrate its efficacy for three such cases: i) new formulas for estimating electronic energy contributions to chemical bonding; ii) analysis of the perturbed electron density of BN doped benzene; and iii) ranking stability estimates for BN doping in over 2,000 naphthalene and over 400 million picene derivatives.

preprint2020arXiv

Data Enhanced Reaction Predictions in Chemical Space With Hammett's Equation

By separating the effect of substituents from chemical process variables, such as reaction mechanism, solvent, or temperature, the Hammett equation enables control of chemical reactivity throughout chemical space. We used global regression to optimize Hammett parameters $ρ$ and $σ$ in two datasets, experimental rate constants for benzylbromides reacting with thiols and the decomposition of ammonium salts, and a synthetic dataset consisting of computational activation energies of $\sim$ 1400 $S_N2$ reactions, with various nucleophiles and leaving groups (-H, -F, -Cl, -Br) and functional groups (-H, -NO$_2$, -CN, -NH$_3$, -CH$_3$). The original approach is generalized to predict potential energies of activation in non aromatic molecular scaffolds with multiple substituents. Individual substituents contribute additively to molecular $σ$ with a unique regression term, which quantifies the inductive effect. Moreover, the position dependence of the substituent can be replaced by a distance decaying factor for $S_N2$. Use of the Hammett equation as a base-line model for $Δ$-Machine learning models of the activation energy in chemical space results in substantially improved learning curves for small training set sizes.

preprint2020arXiv

Dictionary of 140k GDB and ZINC derived AMONs

We present all {\bf A}mons for {\bf G}DB and {\bf Z}inc data-bases using no more than 7 non-hydrogen atoms (AGZ7)---a calculated organic chemistry building-block dictionary based on the AMON approach [Huang and von Lilienfeld, {\em Nature Chemistry} (2020)]. AGZ7 records Cartesian coordinates of compositional and constitutional isomers, as well as properties for $\sim$140k small organic molecules obtained by systematically fragmenting all molecules of Zinc and the majority of GDB17 into smaller entities, saturating with hydrogens, and containing no more than 7 heavy atoms (excluding hydrogen atoms). AGZ7 cover the elements \{H, B, C, N, O, F, Si, P, S, Cl, Br, Sn and I\} and includes optimized geometries, total energy and its decomposition, Mulliken atomic charges, dipole moment vectors, quadrupole tensors, electronic spatial extent, eigenvalues of all occupied orbitals, LUMO, gap, isotropic polarizability, harmonic frequencies, reduced masses, force constants, IR intensity, normal coordinates, rotational constants, zero-point energy, internal energy, enthalpy, entropy, free energy, and heat capacity (all at ambient conditions) using B3LYP/cc-pVTZ (pseudopotentials were used for Sn and I) level of theory. We exemplify the usefulness of this data set with AMON based machine learning models of total potential energy predictions of seven of the most rigid GDB-17 molecules.

preprint2020arXiv

FCHL revisited: faster and more accurate quantum machine learning

We introduce the FCHL19 representation for atomic environments in molecules or condensed-phase systems. Machine learning models based on FCHL19 are able to yield predictions of atomic forces and energies of query compounds with chemical accuracy on the scale of milliseconds. FCHL19 is a revision of our previous work [Faber et al. 2018] where the representation is discretized and the individual features are rigorously optimized using Monte Carlo optimization. Combined with a Gaussian kernel function that incorporates elemental screening, chemical accuracy is reached for energy learning on the QM7b and QM9 datasets after training for minutes and hours, respectively. The model also shows good performance for non-bonded interactions in the condensed phase for a set of water clusters with an MAE binding energy error of less than 0.1 kcal/mol/molecule after training on 3,200 samples. For force learning on the MD17 dataset, our optimized model similarly displays state-of-the-art accuracy with a regressor based on Gaussian process regression. When the revised FCHL19 representation is combined with the operator quantum machine learning regressor, forces and energies can be predicted in only a few milliseconds per atom. The model presented herein is fast and lightweight enough for use in general chemistry problems as well as molecular dynamics simulations.

preprint2020arXiv

Large yet bounded: Spin gap ranges in carbenes

Despite its relevance for chemistry, the electronic structure of free carbenes throughout chemical space has not yet been studied in a systematic manner. We explore a large and systematic carbene chemical space consisting of eight thousand diverse and common carbene scaffolds in their singlet and triplet state computed at controlled accuracy (higher order multireference level of theory) and with verified carbene character in the electronic structure. Originating in strong electron correlation, a hard upper limit for the singlet-triplet gap is found to emerge at around 30 kcal/mol for all the carbene classes in this chemical space. We also observe large vertical and adiabatic spin gap ranges within many carbene classes ($>$100 and $>$60 kcal/mol, respectively), and we report novel relationships between compositional, structural, and electronic degrees of freedom. Our QMspin data base includes numerical results for $\approx$13'000 MRCI calculations on randomly selected carbene scaffolds.

preprint2020arXiv

On the role of gradients for machine learning of molecular energies and forces

The accuracy of any machine learning potential can only be as good as the data used in the fitting process. The most efficient model therefore selects the training data that will yield the highest accuracy compared to the cost of obtaining the training data. We investigate the convergence of prediction errors of quantum machine learning models for organic molecules trained on energy and force labels, two common data types in molecular simulations. When training and predicting on different geometries corresponding to the same single molecule, we find that the inclusion of atomic forces in the training data increases the accuracy of the predicted energies and forces 7-fold, compared to models trained on energy only. Surprisingly, for models trained on sets of organic molecules of varying size and composition in non-equilibrium conformations, inclusion of forces in the training does not improve the predicted energies of unseen molecules in new conformations. Predicted forces, however, also improve about 7-fold. For the systems studied, we find that force labels and energy labels contribute equally per label to the convergence of the prediction errors. Choosing to include derivatives such as atomic forces in the training set or not should thus depend on, not only on the computational cost of acquiring the force labels for training, but also on the application domain, the property of interest, and the desirable size of the machine learning model. Based on our observations we describe key considerations for the creation of datasets for potential energy surfaces of molecules which maximize the efficiency of the resulting machine learning models.

preprint2020arXiv

Quantum machine learning using atom-in-molecule-based fragments selected on-the-fly

First principles based exploration of chemical space deepens our understanding of chemistry, and might help with the design of new materials or experiments. Due to the computational cost of quantum chemistry methods and the immens number of theoretically possible stable compounds comprehensive in-silico screening remains prohibitive. To overcome this challenge, we combine atoms-in-molecules based fragments, dubbed "amons" (A), with active learning in transferable quantum machine learning (ML) models. The efficiency, accuracy, scalability, and transferability of resulting AML models is demonstrated for important molecular quantum properties, such as energies, forces, atomic charges NMR shifts, polarizabilities, and for systems ranging from organic molecules over 2D materials and water clusters to Watson-Crick DNA base-pairs and even ubiquitin. Conceptually, the AML approach extends Mendeleev's table to effectively account for chemical environments, which allows the systematic reconstruction of many chemistries from local building blocks.

preprint2020arXiv

Quantum-chemistry-aided identification, synthesis and experimental validation of model systems for conformationally controlled reaction studies: Separation of the conformers of 2,3-dibromobuta-1,3-diene in the gas phase

The Diels-Alder cycloaddition, in which a diene reacts with a dienophile to form a cyclic compound, counts among the most important tools in organic synthesis. Achieving a precise understanding of its mechanistic details on the quantum level requires new experimental and theoretical methods. Here, we present an experimental approach that separates different diene conformers in a molecular beam as a prerequisite for the investigation of their individual cycloaddition reaction kinetics and dynamics under single-collision conditions in the gas phase. A low- and high-level quantum-chemistry-based screening of more than one hundred dienes identified 2,3-dibromobutadiene (DBB) as an optimal candidate for efficient separation of its gauche and s-trans conformers by electrostatic deflection. A preparation method for DBB was developed which enabled the generation of dense molecular beams of this compound. The theoretical predictions of the molecular properties of DBB were validated by the successful separation of the conformers in the molecular beam. A marked difference in photofragment ion yields of the two conformers upon femtosecond-laser pulse ionization was observed, pointing at a pronounced conformer-specific fragmentation dynamics of ionized DBB. Our work sets the stage for a rigorous examination of mechanistic models of cycloaddition reactions under controlled conditions in the gas phase.

preprint2020arXiv

Thousands of reactants and transition states for competing E2 and S$_\text{N}$2 reactions

Reaction barriers are a crucial ingredient for first principles based computational retro-synthesis efforts as well as for comprehensive reactivity assessments throughout chemical compound space. While extensive databases of experimental results exist, modern quantum machine learning applications require atomistic details which can only be obtained from quantum chemistry protocols. For competing E2 and S$_\text{N}$2 reaction channels we report 4'466 transition state and 143'200 reactant complex geometries and energies at respective MP2/6-311G(d) and single point DF-LCCSD/cc-pVTZ level of theory covering the chemical compound space spanned by the substituents NO$_2$, CN, CH$_3$, and NH$_2$ and early halogens (F, Cl, Br) as nucleophiles and leaving groups. Reactants are chosen such that the activation energy of the competing E2 and S$_\text{N}$2 reactions are of comparable magnitude. The correct concerted motion for each of the one-step reactions has been validated for all transition states. We demonstrate how quantum machine learning models can support data set extension, and discuss the distribution of key internal coordinates of the transition states.

preprint2019arXiv

Alchemical perturbation density functional theory (APDFT)

We introduce an orbital free electron density functional approximation based on alchemical perturbation theory. Given convergent perturbations of a suitable reference system, the accuracy of popular self-consistent Kohn-Sham density functional estimates of properties of new molecules can be systematically surpassed---at negligible cost. The associated energy functional is an approximation to the integrated energy derivative, requiring only perturbed reference electron densities: No self-consistent field equations are necessary to estimate energies and electron densities. Electronic ground state properties considered include covalent bonding potentials, atomic forces, as well as dipole and quadropole moments.

preprint2019arXiv

Atoms in molecules from alchemical perturbation density functional theory

Based on thermodynamic integration we introduce atoms in molecules (AIM) using the orbital-free framework of alchemical perturbation density functional theory (APDFT). Within APDFT, atomic energies and electron densities in molecules are arbitrary because any arbitrary reference system and integration path can be selected as long as it meets the boundary conditions. We choose the uniform electron gas as the most generic reference, and linearly scale up all nuclear charges, situated at any query molecule's atomic coordinates. Within the approximations made when calculating one-particle electron densities, this choice affords exact and unambiguous definitions of energies and electron densities of AIMs We illustrate the approach for neutral iso-electronic diatomics (CO, N$_2$, BF), various small molecules with different electronic hybridisation states of carbon (CH$_4$, C$_2$H$_6$, C$_2$H$_4$, C$_2$H$_2$, HCN), and for all the possible BN doped mutants connecting benzene to borazine (C$_{2n}$B$_{3-n}$N$_{3-n}$H$_6$, $0 \le n \le 3$). Analysis of the numerical results obtained suggests that APDFT based AIMs enable meaningful and new interpretations of molecular energies and electron densities.

preprint2019arXiv

Machine learning the computational cost of quantum chemistry

Computational quantum mechanics based molecular and materials design campaigns consume increasingly more high-performance compute resources, making improved job scheduling efficiency desirable in order to reduce carbon footprint or wasteful spending. We introduce quantum machine learning (QML) models of the computational cost of common quantum chemistry tasks. For 2D non-linear toy systems, single point, geometry optimization, and transition state calculations the out of sample prediction error of QML models of wall times decays systematically with training set size. We present numerical evidence for a toy system containing two functions and three commonly used optimizer and for thousands of organic molecular systems including closed and open shell equilibrium structures, as well as transition states. Levels of electronic structure theory considered include B3LYP/def2-TZVP, MP2/6-311G(d), local CCSD(T)/VTZ-F12, CASSCF/VDZ-F12, and MRCISD+Q-F12/VDZ-F12. In comparison to conventional indiscriminate job treatment, QML based wall time predictions significantly improve job scheduling efficiency for all tasks after training on just thousands of molecules. Resulting reductions in CPU time overhead range from 10% to 90%.

preprint2017arXiv

Machine learning prediction errors better than DFT accuracy

We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of thirteen electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out-of-sample errors as a function of training set size with up to $\sim$117k distinct molecules. Molecular structures and properties at hybrid density functional theory (DFT) level of theory used for training and testing come from the QM9 database [Ramakrishnan et al, {\em Scientific Data} {\bf 1} 140022 (2014)] and include dipole moment, polarizability, HOMO/LUMO energies and gap, electronic spatial extent, zero point vibrational energy, enthalpies and free energies of atomization, heat capacity and the highest fundamental vibrational frequency. Various representations from the literature have been studied (Coulomb matrix, bag of bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed distribution based variants including histograms of distances (HD), and angles (HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian ridge regression (BR) and linear regression with elastic net regularization (EN)), random forest (RF), kernel ridge regression (KRR) and two types of neural net works, graph convolutions (GC) and gated graph networks (GG). We present numerical evidence that ML model predictions deviate from DFT less than DFT deviates from experiment for all properties. Furthermore, our out-of-sample prediction errors with respect to hybrid DFT reference are on par with, or close to, chemical accuracy. Our findings suggest that ML models could be more accurate than hybrid DFT if explicitly electron correlated quantum (or experimental) data was available.

preprint2016arXiv

Fast and accurate predictions of covalent bonds in chemical space

We assess the predictive accuracy of perturbation theory based estimates of changes in covalent bonding due to linear alchemical interpolations among molecules. We have investigated $σ$ bonding to hydrogen, as well as $σ$ and $π$ bonding between main-group elements, occurring in small sets of iso-valence-electronic molecular species with elements drawn from second to fourth rows in the $p$-block of the periodic table. Numerical evidence suggests that first order estimates of covalent bonding potentials can achieve chemical accuracy if (i) the alchemical interpolation is vertical (fixed geometry), (ii) involves molecules containing elements in the third and fourth row of the periodic table, and (iii) a reference geometry is optimized. In this case, changes in the bonding potential become near-linear in coupling parameter, resulting in analytical predictions with very high accuracy ($\sim$1 kcal/mol). Second order estimates deteriorate the prediction. If initial and final molecules differ not only in composition but also in geometry, all estimates become substantially worse, with second order being slightly more accurate than first order. The independent particle approximation to the second order perturbation performs poorly when compared to the coupled perturbed or finite difference approach. Taylor series expansions up to fourth order of the potential energy curve of highly symmetric systems indicate a finite radius of convergence, as illustrated for the alchemical stretching of H$_2^+$. Numerical results are presented for covalent bonds to hydrogen in 12 molecules with 8 valence electrons; (ii) main-group single bonds in 9 molecules with 14 valence electrons; (iii) main-group double bonds in 9 molecules with 12 valence electrons; (iv) main-group triple bonds in 9 molecules with 10 valence electrons; (v) H$_2^+$ single bond with 1 electron.

preprint2016arXiv

Genetic optimization of training sets for improved machine learning models of molecular properties

The training of molecular models of quantum mechanical properties based on statistical machine learning requires large datasets which exemplify the map from chemical structure to molecular property. Intelligent a priori selection of training examples is often difficult or impossible to achieve as prior knowledge may be sparse or unavailable. Ordinarily representative selection of training molecules from such datasets is achieved through random sampling. We use genetic algorithms for the optimization of training set composition consisting of tens of thousands of small organic molecules. The resulting machine learning models are considerably more accurate with respect to small randomly selected training sets: mean absolute errors for out-of-sample predictions are reduced to ~25% for enthalpies, free energies, and zero-point vibrational energy, to ~50% for heat-capacity, electron-spread, and polarizability, and by more than ~20% for electronic properties such as frontier orbital eigenvalues or dipole-moments. We discuss and present optimized training sets consisting of 10 molecular classes for all molecular properties studied. We show that these classes can be used to design improved training sets for the generation of machine learning models of the same properties in similar but unrelated molecular sets.

preprint2016arXiv

Machine Learning Energies of 2 M Elpasolite (ABC$_2$D$_6$) Crystals

Elpasolite is the predominant quaternary crystal structure (AlNaK$_2$F$_6$ prototype) reported in the Inorganic Crystal Structure Database. We have developed a machine learning model to calculate density functional theory quality formation energies of all $\sim$2 M pristine ABC$_2$D$_6$ elpasolite crystals which can be made up from main-group elements (up to bismuth). Our model's accuracy can be improved systematically, reaching 0.1 eV/atom for a training set consisting of 10 k crystals. Important bonding trends are revealed, fluoride is best suited to fit the coordination of the D site which lowers the formation energy whereas the opposite is found for carbon. The bonding contribution of elements A and B is very small on average. Low formation energies result from A and B being late elements from group (II), C being a late (I) element, and D being fluoride. Out of 2 M crystals, 90 unique structures are predicted to be on the convex hull---among which NFAl$_2$Ca$_6$, with peculiar stoichiometry and a negative atomic oxidation state for Al.

preprint2016arXiv

Machine Learning, Quantum Mechanics, and Chemical Compound Space

We review recent studies dealing with the generation of machine learning models of molecular and solid properties. The models are trained and validated using standard quantum chemistry results obtained for organic molecules and materials selected from chemical space at random.

preprint2016arXiv

Rapid yet accurate first principle based predictions of alkali halide crystal phases using alchemical perturbation

We assess the predictive power of alchemical perturbations for estimating fundamental properties in ionic crystals. Using density functional theory we have calculated formation energies, lattice constants, and bulk moduli for all sixteen iso-valence-electronic combinations of pure pristine alkali halides involving elements $A \in \{$Na, K, Rb, Cs$\}$ and $X \in \{$F, Cl, Br, I$\}$. For rock salt, zincblende and cesium chloride symmetry, alchemical Hellmann-Feynman derivatives, evaluated along lattice scans of sixteen reference crystals, have been obtained for all respective 16$\times$15 combinations of reference and predicted target crystals. Mean absolute errors (MAE) are on par with density functional theory level of accuracy for energies and bulk modulus. Predicted lattice constants are less accurate. NaCl is the best reference salt for alchemical estimates of relative energies (MAE $<$ 40 meV/atom) while alkali fluorides are the worst. By contrast, lattice constants are predicted best using NaF as a reference salt (MAE $<$ 0.5Å), yielding only semi-quantitative accuracy. The best reference salt for the prediction of bulk moduli is CsCl (MAE $<$ 0.4$\times$10$^{11}$ dynes/cm$^2$). Alchemical derivatives can also be used to predict competing rock salt and cesium chloride phases in binary and ternary solid mixtures with CsCl. Alchemical predictions based on dispersion corrected density functional theory with pure RbI as a reference salt reproduce reasonably well the reversal of the rock salt/cesium chloride stability trend for binary $(AX)_{1-x}$CsCl$_x$ as well as for ternary $(AX)_{0.5-0.5x}(BY)_{0.5-0.5x}$CsCl$_x$ mixtures.

preprint2016arXiv

Tuning dissociation using isoelectronically doped graphene and hexagonal boron nitride: water and other small molecules

Novel uses for 2-dimensional materials like graphene and hexagonal boron nitride (h-BN) are being frequently discovered especially for membrane and catalysis applications. Still however, a great deal remains to be understood about the interaction of environmentally and industrially elevant molecules such as water with these materials. Taking inspiration from advances in hybridising graphene and h-BN, we explore using density functional theory, the dissociation of water, hydrogen, methane, and methanol on graphene, h-BN, and their isoelectronic doped counterparts: BN doped graphene and C doped h-BN. We find that doped surfaces are considerably more reactive than their pristine counterparts and by comparing the reactivity of several small molecules we develop a general framework for dissociative adsorption. From this a particularly attractive consequence of isoelectronic doping emerges: substrates can be doped to enhance their reactivity specifically towards either polar or non-polar adsorbates. As such, these substrates are potentially viable candidates for selective catalysts and membranes, with the implication that a range of tuneable materials can be designed.

preprint2016arXiv

Understanding molecular representations in machine learning: The role of uniqueness and target similarity

The predictive accuracy of Machine Learning (ML) models of molecular properties depends on the choice of the molecular representation. Based on the postulates of quantum mechanics, we introduce a hierarchy of representations which meet uniqueness and target similarity criteria. To systematically control target similarity, we rely on interatomic many body expansions, as implemented in universal force-fields, including Bonding, Angular, and higher order terms (BA). Addition of higher order contributions systematically increases similarity to the true potential energy and predictive accuracy of the resulting ML models. We report numerical evidence for the performance of BAML models trained on molecular properties pre-calculated at electron-correlated and density functional theory level of theory for thousands of small organic molecules. Properties studied include enthalpies and free energies of atomization, heatcapacity, zero-point vibrational energies, dipole-moment, polarizability, HOMO/LUMO energies and gap, ionization potential, electron affinity, and electronic excitations. After training, BAML predicts energies or electronic properties of out-of-sample molecules with unprecedented accuracy and speed.

preprint2016arXiv

Water on BN doped benzene: A hard test for exchange-correlation functionals and the impact of exact exchange on weak binding

Density functional theory (DFT) studies of weakly interacting complexes have recently focused on the importance of van der Waals dispersion forces whereas, the role of exchange has received far less attention. Here, by exploiting the subtle binding between water and a boron and nitrogen doped benzene derivative (1,2-azaborine) we show how exact exchange can alter the binding conformation within a complex. Benchmark values have been calculated for three orientations of the water monomer on 1,2-azaborine from explicitly correlated quantum chemical methods, and we have also used diffusion quantum Monte Carlo. For a host of popular DFT exchange-correlation functionals we show that the lack of exact exchange leads to the wrong lowest energy orientation of water on 1,2-azaborine. As such, we suggest that a high proportion of exact exchange and the associated improvement in the electronic structure could be needed for the accurate prediction of physisorption sites on doped surfaces and in complex organic molecules. Meanwhile to predict correct absolute interaction energies an accurate description of exchange needs to be augmented by dispersion inclusive functionals, and certain non-local van der Waals functionals (optB88- and optB86b-vdW) perform very well for absolute interaction energies. Through a comparison with water on benzene and borazine (B$_3$N$_3$H$_6$) we show that these results could have implications for the interaction of water with doped graphene surfaces, and suggest a possible way of tuning the interaction energy.

preprint2015arXiv

Crystal Structure Representations for Machine Learning Models of Formation Energies

We introduce and evaluate a set of feature vector representations of crystal structures for machine learning (ML) models of formation energies of solids. ML models of atomization energies of organic molecules have been successful using a Coulomb matrix representation of the molecule. We consider three ways to generalize such representations to periodic systems: (i) a matrix where each element is related to the Ewald sum of the electrostatic interaction between two different atoms in the unit cell repeated over the lattice; (ii) an extended Coulomb-like matrix that takes into account a number of neighboring unit cells; and (iii) an Ansatz that mimics the periodicity and the basic features of the elements in the Ewald sum matrix by using a sine function of the crystal coordinates of the atoms. The representations are compared for a Laplacian kernel with Manhattan norm, trained to reproduce formation energies using a data set of 3938 crystal structures obtained from the Materials Project. For training sets consisting of 3000 crystals, the generalization error in predicting formation energies of new structures corresponds to (i) 0.49, (ii) 0.64, and (iii) 0.37 eV/atom for the respective representations.

preprint2015arXiv

Electronic Spectra from TDDFT and Machine Learning in Chemical Space

Due to its favorable computational efficiency time-dependent (TD) density functional theory (DFT) enables the prediction of electronic spectra in a high-throughput manner across chemical space. Its predictions, however, can be quite inaccurate. We resolve this issue with machine learning models trained on deviations of reference second-order approximate coupled-cluster singles and doubles (CC2) spectra from TDDFT counterparts, or even from DFT gap. We applied this approach to low-lying singlet-singlet vertical electronic spectra of over 20 thousand synthetically feasible small organic molecules with up to eight CONF atoms. The prediction errors decay monotonously as a function of training set size. For a training set of 10 thousand molecules, CC2 excitation energies can be reproduced to within $\pm$0.1 eV for the remaining molecules. Analysis of our spectral database via chromophore counting suggests that even higher accuracies can be achieved. Based on the evidence collected, we discuss open challenges associated with data-driven modeling of high-lying spectra, and transition intensities.

preprint2015arXiv

Fourier series of atomic radial distribution functions: A molecular fingerprint for machine learning models of quantum chemical properties

We introduce a fingerprint representation of molecules based on a Fourier series of atomic radial distribution functions. This fingerprint is unique (except for chirality), continuous, and differentiable with respect to atomic coordinates and nuclear charges. It is invariant with respect to translation, rotation, and nuclear permutation, and requires no pre-conceived knowledge about chemical bonding, topology, or electronic orbitals. As such it meets many important criteria for a good molecular representation, suggesting its usefulness for machine learning models of molecular properties trained across chemical compound space. To assess the performance of this new descriptor we have trained machine learning models of molecular enthalpies of atomization for training sets with up to 10k organic molecules, drawn at random from a published set of 134k organic molecules. We validate the descriptor on all remaining molecules of the 134k set. For a training set of 5k molecules the fingerprint descriptor achieves a mean absolute error of 8.0 kcal/mol, respectively. This is slightly worse than the performance attained using the Coulomb matrix, another popular alternative, reaching 6.2 kcal/mol for the same training and test sets.

preprint2015arXiv

Machine learning for many-body physics: efficient solution of dynamical mean-field theory

Machine learning methods for solving the equations of dynamical mean-field theory are developed. The method is demonstrated on the three dimensional Hubbard model. The key technical issues are defining a mapping of an input function to an output function, and distinguishing metallic from insulating solutions. Both metallic and Mott insulator solutions can be predicted. The validity of the machine learning scheme is assessed by comparing predictions of full correlation functions, of quasi-particle weight and particle density to values directly computed. The results indicate that with modest further development, machine learning approach may be an attractive computational efficient option for real materials predictions for strongly correlated systems.

preprint2015arXiv

Machine Learning for Quantum Mechanical Properties of Atoms in Molecules

We introduce machine learning models of quantum mechanical observables of atoms in molecules. Instant out-of-sample predictions for proton and carbon nuclear chemical shifts, atomic core level excitations, and forces on atoms reach accuracies on par with density functional theory reference. Locality is exploited within non-linear regression via local atom-centered coordinate systems. The approach is validated on a diverse set of 9k small organic molecules. Linear scaling of computational cost in system size is demonstrated for saturated polymers with up to sub-mesoscale lengths.

preprint2015arXiv

Many Molecular Properties from One Kernel in Chemical Space

We introduce property-independent kernels for machine learning modeling of arbitrarily many molecular properties. The kernels encode molecular structures for training sets of varying size, as well as similarity measures sufficiently diffuse in chemical space to sample over all training molecules. Corresponding molecular reference properties provided, they enable the instantaneous generation of ML models which can systematically be improved through the addition of more data. This idea is exemplified for single kernel based modeling of internal energy, enthalpy, free energy, heat capacity, polarizability, electronic spread, zero-point vibrational energy, energies of frontier orbitals, HOMO-LUMO gap, and the highest fundamental vibrational wavenumber. Models of these properties are trained and tested using 112 kilo organic molecules of similar size. Resulting models are discussed as well as the kernels' use for generating and using other property models.

preprint2015arXiv

Quantum Mechanical Treatment of Variable Molecular Composition: From "Alchemical" Changes of State Functions to Rational Compound Design

"Alchemical" interpolation paths, i.e.~coupling systems along fictitious paths that without realistic correspondence, are frequently used within materials and molecular modeling and simulation protocols for the estimation of relative changes in state functions such as free energies. We discuss alchemical changes in the context of quantum chemistry, and present illustrative numerical results for the changes of HOMO eigenvalues of the He atom due to a linear alchemical teleportation---the simultaneous annihilation and creation of nuclear charges at different locations. To demonstrate the predictive power of alchemical first order derivatives (Hellmann-Feynman) the covalent bond potential of hydrogen fluoride and hydrogen chloride is investigated, as well as the van-der-Waals binding in the water-water and water-hydrogen fluoride dimer, respectively. Based on converged electron densities for one configuration, the versatility of alchemical derivatives is exemplified for the screening of entire binding potentials with reasonable accuracy. Finally, we discuss constraints for the identification of non-linear coupling potentials for which the energy's Hellmann-Feynman derivative will yield accurate predictions.

preprint2015arXiv

Water on hexagonal boron nitride from diffusion Monte Carlo

Despite a recent flurry of experimental and simulation studies, an accurate estimate of the interaction strength of water molecules with hexagonal boron nitride is lacking. Here we report quantum Monte Carlo results for the adsorption of a water monomer on a periodic hexagonal boron nitride sheet, which yield a water monomer interaction energy of -84 +/- 5 meV. We use the results to evaluate the performance of several widely used density functional theory (DFT) exchange correlation functionals, and find that they all deviate substantially. Differences in interaction energies between different adsorption sites are however better reproduced by DFT.

preprint2014arXiv

Machine learning for many-body physics: The case of the Anderson impurity model

Machine learning methods are applied to finding the Green's function of the Anderson impurity model, a basic model system of quantum many-body condensed-matter physics. Different methods of parametrizing the Green's function are investigated; a representation in terms of Legendre polynomials is found to be superior due to its limited number of coefficients and its applicability to state of the art methods of solution. The dependence of the errors on the size of the training set is determined. The results indicate that a machine learning approach to dynamical mean-field theory may be feasible.

preprint2014arXiv

Modeling Electronic Quantum Transport with Machine Learning

We present a Machine Learning approach to solve electronic quantum transport equations of one-dimensional nanostructures. The transmission coefficients of disordered systems were computed to provide training and test datasets to the machine. The system's representation encodes energetic as well as geometrical information to characterize similarities between disordered configurations, while the Euclidean norm is used as a measure of similarity. Errors for out-of-sample predictions systematically decrease with training set size, enabling the accurate and fast prediction of new transmission coefficients. The remarkable performance of our model to capture the complexity of interference phenomena lends further support to its viability in dealing with transport problems of undulatory nature.

preprint2014arXiv

Toward transferable interatomic van der Waals interactions without electrons: The role of multipole electrostatics and many-body dispersion

We estimate polarizabilities of atoms in molecules without electron density, using a Voronoi tesselation approach instead of conventional density partitioning schemes. The resulting atomic dispersion coefficients are calculated, as well as many-body dispersion effects on intermolecular potential energies. We also estimate contributions from multipole electrostatics and compare them to dispersion. We assess the performance of the resulting intermolecular interaction model from dispersion and electrostatics for more than 1,300 neutral and charged, small organic molecular dimers. Applications to water clusters, the benzene crystal, the anti-cancer drug ellipticine---intercalated between two Watson-Crick DNA base pairs, as well as six macro-molecular host-guest complexes highlight the potential of this method and help to identify points of future improvement. The mean absolute error made by the combination of static electrostatics with many-body dispersion reduces at larger distances, while it plateaus for two-body dispersion, in conflict with the common assumption that the simple $1/R^6$ correction will yield proper dissociative tails. Overall, the method achieves an accuracy well within conventional molecular force fields while exhibiting a simple parametrization protocol.

preprint2013arXiv

Force correcting atom centered potentials for generalized gradient approximated density functional theory: Approaching hybrid functional accuracy for geometries and harmonic frequencies in small chlorofluorocarbons

Generalized gradient approximated (GGA) density functional theory (DFT) typically overestimates polarizability and bond-lengths, and underestimates force constants of covalent bonds. To overcome this problem we show that one can use empirical force correcting atom centered potentials (FCACPs), parameterized for every nuclear species. Parameters are obtained through minimization of a penalty functional that explicitly encodes hybrid DFT forces and static polarizabilities of reference molecules. For hydrogen, fluorine, chlorine, and carbon the respective reference molecules consist of H$_2$, F$_2$, Cl$_2$, and CH$_4$. The transferability of this approach is assessed for harmonic frequencies in a small set of chlorofluorocarbon molecules. Numerical evidence, gathered for CF$_4$, CCl$_4$, CCl$_3$F, CCl$_2$F$_2$, CClF$_3$, ClF, HF, HCl, CFH$_3$, CF$_2$H$_2$, CF$_3$H, CHCl$_3$, CH$_2$Cl$_2$, CH$_3$Cl indicates that the GGA+FCACP level of theory yields harmonic frequencies that are significantly more consistent with hybrid DFT values, as well as slightly reduced molecular polarizability.

preprint2013arXiv

Machine Learning of Molecular Electronic Properties in Chemical Compound Space

The combination of modern scientific computing with electronic structure theory can lead to an unprecedented amount of data amenable to intelligent data analysis for the identification of meaningful, novel, and predictive structure-property relationships. Such relationships enable high-throughput screening for relevant properties in an exponentially growing pool of virtual compounds that are synthetically accessible. Here, we present a machine learning (ML) model, trained on a data base of \textit{ab initio} calculation results for thousands of organic molecules, that simultaneously predicts multiple electronic ground- and excited-state properties. The properties include atomization energy, polarizability, frontier orbital eigenvalues, ionization potential, electron affinity, and excitation energies. The ML model is based on a deep multi-task artificial neural network, exploiting underlying correlations between various molecular properties. The input is identical to \emph{ab initio} methods, \emph{i.e.} nuclear charges and Cartesian coordinates of all atoms. For small organic molecules the accuracy of such a "Quantum Machine" is similar, and sometimes superior, to modern quantum-chemical methods---at negligible computational cost.

preprint2011arXiv

Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning

We introduce a machine learning model to predict atomization energies of a diverse set of organic molecules, based on nuclear charges and atomic positions only. The problem of solving the molecular Schrödinger equation is mapped onto a non-linear statistical regression problem of reduced complexity. Regression models are trained on and compared to atomization energies computed with hybrid density-functional theory. Cross-validation over more than seven thousand small organic molecules yields a mean absolute error of ~10 kcal/mol. Applicability is demonstrated for the prediction of molecular atomization potential energy curves.

O. Anatole von Lilienfeld

What is connected

Connect this record

See the researcher in context

Building this map preview

39 published item(s)

Ab initio machine learning of phase space averages

Alchemical geometry relaxation

An orbital-based representation for accurate Quantum Machine Learning

Elucidating atmospheric brown carbon -- Supplanting chemical intuition with exhaustive enumeration and machine learning

Simplifying inverse material design problems for fixed lattices with alchemical chirality

Data Enhanced Reaction Predictions in Chemical Space With Hammett's Equation

Dictionary of 140k GDB and ZINC derived AMONs

FCHL revisited: faster and more accurate quantum machine learning

Large yet bounded: Spin gap ranges in carbenes

On the role of gradients for machine learning of molecular energies and forces

Quantum machine learning using atom-in-molecule-based fragments selected on-the-fly

Quantum-chemistry-aided identification, synthesis and experimental validation of model systems for conformationally controlled reaction studies: Separation of the conformers of 2,3-dibromobuta-1,3-diene in the gas phase

Thousands of reactants and transition states for competing E2 and S$_\text{N}$2 reactions

Alchemical perturbation density functional theory (APDFT)

Atoms in molecules from alchemical perturbation density functional theory

Machine learning the computational cost of quantum chemistry

Machine learning prediction errors better than DFT accuracy

Fast and accurate predictions of covalent bonds in chemical space

Genetic optimization of training sets for improved machine learning models of molecular properties

Machine Learning Energies of 2 M Elpasolite (ABC$_2$D$_6$) Crystals

Machine Learning, Quantum Mechanics, and Chemical Compound Space

Rapid yet accurate first principle based predictions of alkali halide crystal phases using alchemical perturbation

Tuning dissociation using isoelectronically doped graphene and hexagonal boron nitride: water and other small molecules

Understanding molecular representations in machine learning: The role of uniqueness and target similarity

Water on BN doped benzene: A hard test for exchange-correlation functionals and the impact of exact exchange on weak binding

Crystal Structure Representations for Machine Learning Models of Formation Energies

Electronic Spectra from TDDFT and Machine Learning in Chemical Space

Fourier series of atomic radial distribution functions: A molecular fingerprint for machine learning models of quantum chemical properties

Machine learning for many-body physics: efficient solution of dynamical mean-field theory

Machine Learning for Quantum Mechanical Properties of Atoms in Molecules

Many Molecular Properties from One Kernel in Chemical Space

Quantum Mechanical Treatment of Variable Molecular Composition: From "Alchemical" Changes of State Functions to Rational Compound Design

Water on hexagonal boron nitride from diffusion Monte Carlo

Machine learning for many-body physics: The case of the Anderson impurity model

Modeling Electronic Quantum Transport with Machine Learning

Toward transferable interatomic van der Waals interactions without electrons: The role of multipole electrostatics and many-body dispersion

Force correcting atom centered potentials for generalized gradient approximated density functional theory: Approaching hybrid functional accuracy for geometries and harmonic frequencies in small chlorofluorocarbons

Machine Learning of Molecular Electronic Properties in Chemical Compound Space

Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning