Source author record

Simona Cocco

Simona Cocco appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

cond-mat.stat-mech Biomolecules Biological Physics Machine Learning Quantitative Methods cond-mat.dis-nn Neurons and Cognition cond-mat.soft Methodology Molecular Networks physics.data-an

Catalog footprint

What is connected

18works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models

Energy-based models (EBMs) are flexible generative architectures inspired by statistical physics, but their learning and generative properties remain poorly understood. Here, we analyze a solvable EBM in the high-dimensional limit: the spherical Boltzmann machine (SBM). Combining tools from random matrix theory and dynamical mean-field theory, we: solve exact equations describing the training dynamics of the SBM; compute the Bayesian evidence, which acts as a partition function in parameter space and encodes global properties of the trained model; and uncover cascades of phase transitions that occur both during training and as a function of hyperparameters, related to successive alignment and condensation of the top modes of the coupling matrix to the data. We connect these transitions to sampling-time generative phenomena in a teacher-student scenario, including: sampling temperature tuning, double descent as a function of regularization strength, tempered posterior effects, and out-of-equilibrium effects during training that induce biases in the trained model. We provide numerical evidence demonstrating that all these phenomena appear in standard generative architectures, beyond the SBM.

preprint2022arXiv

Statistical-physics approaches to RNA molecules, families and networks

This contribution focuses on the fascinating RNA molecule, its sequence-dependent folding driven by base-pairing interactions, the interplay between these interactions and natural evolution, and its multiple regulatory roles. The four of us have dug into these topics using the tools and the spirit of the statistical physics of disordered systems, and in particular the concept of a disordered (energy/fitness) landscape. After an introduction to RNA molecules and the perspectives they open not only in evolutionary and synthetic biology but also in medicine, we will introduce the important notions of energy and fitness landscapes for these molecules. In Section III we will review some models and algorithms for RNA sequence-to-secondary-structure mapping. Section IV discusses how the secondary-structure energy landscape can be derived from unzipping data. Section V deals with the inference of RNA structure from evolutionary sequence data sampled in different organisms. This will shift the focus from the `sequence-to-structure' mapping described in Section III to a `sequence-to-function' landscape that can be inferred from laboratory evolutionary data on DNA aptamers. Finally, in Section VI, we shall discuss the rich theoretical picture linking networks of interacting RNA molecules to the organization of robust, systemic regulatory programs. Along this path, we will therefore explore phenomena across multiple scales in space, number of molecules and time, showing how the biological complexity of the RNA world can be captured by the unifying concepts of statistical physics.

preprint2020arXiv

Gaussian Closure Scheme in the Quasi-Linkage Equilibrium Regime of Evolving Genome Populations

Describing the evolution of a population of genomes evolving in a complex fitness landscape is generally very hard. We here introduce an approximate Gaussian closure scheme to characterize analytically the statistics of a genomic population in the so-called Quasi--Linkage Equilibrium (QLE) regime, applicable to generic values of the rates of mutation or recombination and fitness functions. The Gaussian approximation is illustrated on a short-range fitness landscape with two far away and competing maxima. It unveils the existence of a phase transition from a broad to a polarized distribution of genomes as the strength of epistatic couplings is increased, characterized by slow coarsening dynamics of competing allele domains. Results of the closure scheme are corroborated by numerical simulations.

preprint2020arXiv

Inference of compressed Potts graphical models

We consider the problem of inferring a graphical Potts model on a population of variables, with a non-uniform number of Potts colors (symbols) across variables. This inverse Potts problem generally involves the inference of a large number of parameters, often larger than the number of available data, and, hence, requires the introduction of regularization. We study here a double regularization scheme, in which the number of colors available to each variable is reduced, and interaction networks are made sparse. To achieve this color compression scheme, only Potts states with large empirical frequency (exceeding some threshold) are explicitly modeled on each site, while the others are grouped into a single state. We benchmark the performances of this mixed regularization approach, with two inference algorithms, the Adaptive Cluster Expansion (ACE) and the PseudoLikelihood Maximization (PLM) on synthetic data obtained by sampling disordered Potts models on an Erdos-Renyi random graphs. We show in particular that color compression does not affect the quality of reconstruction of the parameters corresponding to high-frequency symbols, while drastically reducing the number of the other parameters and thus the computational time. Our procedure is also applied to multi-sequence alignments of protein families, with similar results.

preprint2019arXiv

'Place-cell' emergence and learning of invariant data with restricted Boltzmann machines: breaking and dynamical restoration of continuous symmetries in the weight space

Distributions of data or sensory stimuli often enjoy underlying invariances. How and to what extent those symmetries are captured by unsupervised learning methods is a relevant question in machine learning and in computational neuroscience. We study here, through a combination of numerical and analytical tools, the learning dynamics of Restricted Boltzmann Machines (RBM), a neural network paradigm for representation learning. As learning proceeds from a random configuration of the network weights, we show the existence of, and characterize a symmetry-breaking phenomenon, in which the latent variables acquire receptive fields focusing on limited parts of the invariant manifold supporting the data. The symmetry is restored at large learning times through the diffusion of the receptive field over the invariant manifold; hence, the RBM effectively spans a continuous attractor in the space of network weights. This symmetry-breaking phenomenon takes place only if the amount of data available for training exceeds some critical value, depending on the network size and the intensity of symmetry-induced correlations in the data; below this 'retarded-learning' threshold, the network weights are essentially noisy and overfit the data.

preprint2016arXiv

Benchmarking inverse statistical approaches for protein structure and design with exactly solvable models

Inverse statistical approaches to determine protein structure and function from Multiple Sequence Alignments (MSA) are emerging as powerful tools in computational biology. However the underlying assumptions of the relationship between the inferred effective Potts Hamiltonian and real protein structure and energetics remain untested so far. Here we use lattice protein model (LP) to benchmark those inverse statistical approaches. We build MSA of highly stable sequences in target LP structures, and infer the effective pairwise Potts Hamiltonians from those MSA. We find that inferred Potts Hamiltonians reproduce many important aspects of 'true' LP structures and energetics. Careful analysis reveals that effective pairwise couplings in inferred Potts Hamiltonians depend not only on the energetics of the native structure but also on competing folds; in particular, the coupling values reflect both positive design (stabilization of native conformation) and negative design (destabilization of competing folds). In addition to providing detailed structural information, the inferred Potts models used as protein Hamiltonian for design of new sequences are able to generate with high probability completely new sequences with the desired folds, which is not possible using independent-site models. Those are remarkable results as the effective LP Hamiltonians used to generate MSA are not simple pairwise models due to the competition between the folds. Our findings elucidate the reasons for the success of inverse approaches to the modelling of proteins from sequence data, and their limitations.

preprint2015arXiv

Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction

Despite the biological importance of non-coding RNA, their structural characterization remains challenging. Making use of the rapidly growing sequence databases, we analyze nucleotide coevolution across homologous sequences via Direct-Coupling Analysis to detect nucleotide-nucleotide contacts. For a representative set of riboswitches, we show that the results of Direct-Coupling Analysis in combination with a generalized Nussinov algorithm systematically improve the results of RNA secondary structure prediction beyond traditional covariance approaches based on mutual information. Even more importantly, we show that the results of Direct-Coupling Analysis are enriched in tertiary structure contacts. By integrating these predictions into molecular modeling tools, systematically improved tertiary structure predictions can be obtained, as compared to using secondary structure information alone.

preprint2015arXiv

Learning probabilities from random observables in high dimensions: the maximum entropy distribution and others

We consider the problem of learning a target probability distribution over a set of $N$ binary variables from the knowledge of the expectation values (with this target distribution) of $M$ observables, drawn uniformly at random. The space of all probability distributions compatible with these $M$ expectation values within some fixed accuracy, called version space, is studied. We introduce a biased measure over the version space, which gives a boost increasing exponentially with the entropy of the distributions and with an arbitrary inverse `temperature' $Γ$. The choice of $Γ$ allows us to interpolate smoothly between the unbiased measure over all distributions in the version space ($Γ=0$) and the pointwise measure concentrated at the maximum entropy distribution ($Γ\to \infty$). Using the replica method we compute the volume of the version space and other quantities of interest, such as the distance $R$ between the target distribution and the center-of-mass distribution over the version space, as functions of $α=(\log M)/N$ and $Γ$ for large $N$. Phase transitions at critical values of $α$ are found, corresponding to qualitative improvements in the learning of the target distribution and to the decrease of the distance $R$. However, for fixed $α$, the distance $R$ does not vary with $Γ$, which means that the maximum entropy distribution is not closer to the target distribution than any other distribution compatible with the observable values. Our results are confirmed by Monte Carlo sampling of the version space for small system sizes ($N\le 10$).

preprint2015arXiv

On the entropy of protein families

Proteins are essential components of living systems, capable of performing a huge variety of tasks at the molecular level, such as recognition, signalling, copy, transport, ... The protein sequences realizing a given function may largely vary across organisms, giving rise to a protein family. Here, we estimate the entropy of those families based on different approaches, including Hidden Markov Models used for protein databases and inferred statistical models reproducing the low-order (1-and 2-point) statistics of multi-sequence alignments. We also compute the entropic cost, that is, the loss in entropy resulting from a constraint acting on the protein, such as the fixation of one particular amino-acid on a specific site, and relate this notion to the escape probability of the HIV virus. The case of lattice proteins, for which the entropy can be computed exactly, allows us to provide another illustration of the concept of cost, due to the competition of different folds. The relevance of the entropy in relation to directed evolution experiments is stressed.

preprint2014arXiv

Stochastic Ratchet Mechanisms for Replacement of Proteins Bound to DNA

Experiments indicate that unbinding rates of proteins from DNA can depend on the concentration of proteins in nearby solution. Here we present a theory of multi-step replacement of DNA-bound proteins by solution-phase proteins. For four different kinetic scenarios we calculate the depen- dence of protein unbinding and replacement rates on solution protein concentration. We find (1) strong effects of progressive 'rezipping' of the solution-phase protein onto DNA sites liberated by 'unzipping' of the originally bound protein; (2) that a model in which solution-phase proteins bind non-specifically to DNA can describe experiments on exchanges between the non specific DNA- binding proteins Fis-Fis and Fis-HU; (3) that a binding specific model describes experiments on the exchange of CueR proteins on specific binding sites.

preprint2013arXiv

From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction

Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predicting residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we introduce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts model allows us to identify relevant 'patterns' of residues from the knowledge of the eigenmodes and eigenvalues of the residue-residue correlation matrix. We show how the computation of such statistical patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of parameters than DCA. This dimensional reduction allows us to avoid overfitting and to extract contact information from multiple-sequence alignments of reduced size. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, are important to recover structural information: the corresponding patterns are highly localized, that is, they are concentrated in few sites, which we find to be in close contact in the three-dimensional protein fold.

preprint2011arXiv

Adaptive Cluster Expansion for Inferring Boltzmann Machines with Noisy Data

We introduce a procedure to infer the interactions among a set of binary variables, based on their sampled frequencies and pairwise correlations. The algorithm builds the clusters of variables contributing most to the entropy of the inferred Ising model, and rejects the small contributions due to the sampling noise. Our procedure successfully recovers benchmark Ising models even at criticality and in the low temperature phase, and is applied to neurobiological data.

preprint2011arXiv

Adaptive cluster expansion for the inverse Ising problem: convergence, algorithm and tests

We present a procedure to solve the inverse Ising problem, that is to find the interactions between a set of binary variables from the measure of their equilibrium correlations. The method consists in constructing and selecting specific clusters of variables, based on their contributions to the cross-entropy of the Ising model. Small contributions are discarded to avoid overfitting and to make the computation tractable. The properties of the cluster expansion and its performances on synthetic data are studied. To make the implementation easier we give the pseudo-code of the algorithm.

preprint2011arXiv

Fast Inference of Interactions in Assemblies of Stochastic Integrate-and-Fire Neurons from Spike Recordings

We present two Bayesian procedures to infer the interactions and external currents in an assembly of stochastic integrate-and-fire neurons from the recording of their spiking activity. The first procedure is based on the exact calculation of the most likely time courses of the neuron membrane potentials conditioned by the recorded spikes, and is exact for a vanishing noise variance and for an instantaneous synaptic integration. The second procedure takes into account the presence of fluctuations around the most likely time courses of the potentials, and can deal with moderate noise levels. The running time of both procedures is proportional to the number S of spikes multiplied by the squared number N of neurons. The algorithms are validated on synthetic data generated by networks with known couplings and currents. We also reanalyze previously published recordings of the activity of the salamander retina (including from 32 to 40 neurons, and from 65,000 to 170,000 spikes). We study the dependence of the inferred interactions on the membrane leaking time; the differences and similarities with the classical cross-correlation analysis are discussed.

preprint2011arXiv

High-Dimensional Inference with the generalized Hopfield Model: Principal Component Analysis and Corrections

We consider the problem of inferring the interactions between a set of N binary variables from the knowledge of their frequencies and pairwise correlations. The inference framework is based on the Hopfield model, a special case of the Ising model where the interaction matrix is defined through a set of patterns in the variable space, and is of rank much smaller than N. We show that Maximum Lik elihood inference is deeply related to Principal Component Analysis when the amp litude of the pattern components, xi, is negligible compared to N^1/2. Using techniques from statistical mechanics, we calculate the corrections to the patterns to the first order in xi/N^1/2. We stress that it is important to generalize the Hopfield model and include both attractive and repulsive patterns, to correctly infer networks with sparse and strong interactions. We present a simple geometrical criterion to decide how many attractive and repulsive patterns should be considered as a function of the sampling noise. We moreover discuss how many sampled configurations are required for a good inference, as a function of the system size, N and of the amplitude, xi. The inference approach is illustrated on synthetic and biological data.

preprint2011arXiv

On the trajectories and performance of Infotaxis, an information-based greedy search algorithm

We present a continuous-space version of Infotaxis, a search algorithm where a searcher greedily moves to maximize the gain in information about the position of the target to be found. Using a combination of analytical and numerical tools we study the nature of the trajectories in two and three dimensions. The probability that the search is successful and the running time of the search are estimated. A possible extension to non-greedy search is suggested.

preprint2008arXiv

Dynamical modelling of molecular constructions and setups for DNA unzipping

We present a dynamical model of DNA mechanical unzipping under the action of a force. The model includes the motion of the fork in the sequence-dependent landscape, the trap(s) acting on the bead(s), and the polymeric components of the molecular construction (unzipped single strands of DNA, and linkers). Different setups are considered to test the model, and the outcome of the simulations is compared to simpler dynamical models existing in the literature where polymers are assumed to be at equilibrium.

preprint2007arXiv

Inferring DNA sequences from mechanical unzipping data: the large-bandwidth case

The complementary strands of DNA molecules can be separated when stretched apart by a force; the unzipping signal is correlated to the base content of the sequence but is affected by thermal and instrumental noise. We consider here the ideal case where opening events are known to a very good time resolution (very large bandwidth), and study how the sequence can be reconstructed from the unzipping data. Our approach relies on the use of statistical Bayesian inference and of Viterbi decoding algorithm. Performances are studied numerically on Monte Carlo generated data, and analytically. We show how multiple unzippings of the same molecule may be exploited to improve the quality of the prediction, and calculate analytically the number of required unzippings as a function of the bandwidth, the sequence content, the elasticity parameters of the unzipped strands.

Simona Cocco

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

Spherical Boltzmann machines: a solvable theory of learning and generation in energy-based models

Statistical-physics approaches to RNA molecules, families and networks

Gaussian Closure Scheme in the Quasi-Linkage Equilibrium Regime of Evolving Genome Populations

Inference of compressed Potts graphical models

'Place-cell' emergence and learning of invariant data with restricted Boltzmann machines: breaking and dynamical restoration of continuous symmetries in the weight space

Benchmarking inverse statistical approaches for protein structure and design with exactly solvable models

Direct-Coupling Analysis of nucleotide coevolution facilitates RNA secondary and tertiary structure prediction

Learning probabilities from random observables in high dimensions: the maximum entropy distribution and others

On the entropy of protein families

Stochastic Ratchet Mechanisms for Replacement of Proteins Bound to DNA

From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction

Adaptive Cluster Expansion for Inferring Boltzmann Machines with Noisy Data

Adaptive cluster expansion for the inverse Ising problem: convergence, algorithm and tests

Fast Inference of Interactions in Assemblies of Stochastic Integrate-and-Fire Neurons from Spike Recordings

High-Dimensional Inference with the generalized Hopfield Model: Principal Component Analysis and Corrections

On the trajectories and performance of Infotaxis, an information-based greedy search algorithm

Dynamical modelling of molecular constructions and setups for DNA unzipping

Inferring DNA sequences from mechanical unzipping data: the large-bandwidth case