Source author record

Hong-Li Zeng

Hong-Li Zeng appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

cond-mat.dis-nn physics.data-an Quantitative Methods Biological Physics Computation Genomics Machine Learning Methodology physics.app-ph physics.comp-ph physics.soc-ph Populations and Evolution stat.OT

Catalog footprint

What is connected

9works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Temporal epistasis inference from more than 3,500,000 SARS-CoV-2 Genomic Sequences

We use Direct Coupling Analysis (DCA) to determine epistatic interactions between loci of variability of the SARS-CoV-2 virus, segmenting genomes by month of sampling. We use full-length, high-quality genomes from the GISAID repository up to October 2021, in total over 3,500,000 genomes. We find that DCA terms are more stable over time than correlations, but nevertheless change over time as mutations disappear from the global population or reach fixation. Correlations are enriched for phylogenetic effects, and in particularly statistical dependencies at short genomic distances, while DCA brings out links at longer genomic distance. We discuss the validity of a DCA analysis under these conditions in terms of a transient Quasi-Linkage Equilibrium state. We identify putative epistatic interaction mutations involving loci in Spike.

preprint2020arXiv

Global analysis of more than 50,000 SARS-Cov-2 genomes reveals epistasis between 8 viral genes

Genome-wide epistasis analysis is a powerful tool to infer gene interactions, which can guide drug and vaccine development and lead to a deeper understanding of microbial pathogenesis. We have considered all complete SARS-CoV-2 genomes deposited in the GISAID repository until \textbf{four} different cut-off dates, and used Direct Coupling Analysis together with an assumption of Quasi-Linkage Equilibrium to infer epistatic contributions to fitness from polymorphic loci. We find \textbf{eight} interactions, of which three between pairs where one locus lies in gene ORF3a, both loci holding non-synonymous mutations. We also find interactions between two loci in gene nsp13, both holding non-synonymous mutations, and four interactions involving one locus holding a synonymous mutation. Altogether we infer interactions between loci in viral genes ORF3a and nsp2, nsp12 and nsp6, between ORF8 and nsp4, and between loci in genes nsp2, nsp13 and nsp14. The paper opens the prospect to use prominent epistatically linked pairs as a starting point to search for combinatorial weaknesses of recombinant viral pathogens.

preprint2020arXiv

Inferring genetic fitness from genomic data

The genetic composition of a naturally developing population is considered as due to mutation, selection, genetic drift and recombination. Selection is modeled as single-locus terms (additive fitness) and two-loci terms (pairwise epistatic fitness). The problem is posed to infer epistatic fitness from population-wide whole-genome data from a time series of a developing population. We generate such data in silico, and show that in the Quasi-Linkage Equilibrium (QLE) phase of Kimura, Neher and Shraiman, that pertains at high enough recombination rates and low enough mutation rates, epistatic fitness can be quantitatively correctly inferred using inverse Ising/Potts methods.

preprint2020arXiv

Inverse Ising techniques to infer underlying mechanisms from data

As a problem in data science the inverse Ising (or Potts) problem is to infer the parameters of a Gibbs-Boltzmann distributions of an Ising (or Potts) model from samples drawn from that distribution. The algorithmic and computational interest stems from the fact that this inference task cannot be done efficiently by the maximum likelihood criterion, since the normalizing constant of the distribution (the partition function) can not be calculated exactly and efficiently. The practical interest on the other hand flows from several outstanding applications, of which the most well known has been predicting spatial contacts in protein structures from tables of homologous protein sequences. Most applications to date have been to data that has been produced by a dynamical process which, as far as it is known, cannot be expected to satisfy detailed balance. There is therefore no a priori reason to expect the distribution to be of the Gibbs-Boltzmann type, and no a priori reason to expect that inverse Ising (or Potts) techniques should yield useful information. In this review we discuss two types of problems where progress nevertheless can be made. We find that depending on model parameters there are phases where, in fact, the distribution is close to Gibbs-Boltzmann distribution, a non-equilibrium nature of the under-lying dynamics notwithstanding. We also discuss the relation between inferred Ising model parameters and parameters of the underlying dynamics.

preprint2020arXiv

Longitudinal Support Vector Machines for High Dimensional Time Series

We consider the problem of learning a classifier from observed functional data. Here, each data-point takes the form of a single time-series and contains numerous features. Assuming that each such series comes with a binary label, the problem of learning to predict the label of a new coming time-series is considered. Hereto, the notion of {\em margin} underlying the classical support vector machine is extended to the continuous version for such data. The longitudinal support vector machine is also a convex optimization problem and its dual form is derived as well. Empirical results for specified cases with significance tests indicate the efficacy of this innovative algorithm for analyzing such long-term multivariate data.

preprint2020arXiv

Network reconstruction from asynchronously updated evolutionary game

The interactions between players of prisoner's dilemma (PD) game are reconstructed with evolutionary game data. All participants play the game with their counterparts and gain corresponding rewards during each round of the game. However, their strategies are updated asynchronously during the evolutionary PD game. Two inference methods of the interactions between players are derived with naive mean-field (nMF) approximation and maximum log-likelihood estimation (MLE) respectively. The two methods are tested numerically also for fully connected asymmetric Sherrington-Kirkpatrick (SK) models, varying the data length, asymmetric degree, payoff and system noise (coupling strength). We find that the reconstruction mean square error (MSE) of MLE method is proportional to the inverse of data length and typically half (benefit from the extra information of update times) of that by nMF. Both methods are robust to the asymmetric degree but works better for large payoff. Compared with MLE, nMF is more sensitive to the couplings strength which prefers weak couplings.

preprint2013arXiv

Maximum likelihood reconstruction for Ising models with asynchronous updates

We describe how the couplings in an asynchronous kinetic Ising model can be inferred. We consider two cases, one in which we know both the spin history and the update times and one in which we only know the spin history. For the first case, we show that one can average over all possible choices of update times to obtain a learning rule that depends only on spin correlations and can also be derived from the equations of motion for the correlations. For the second case, the same rule can be derived within a further decoupling approximation. We study all methods numerically for fully asymmetric Sherrington-Kirkpatrick models, varying the data length, system size, temperature, and external field. Good convergence is observed in accordance with the theoretical expectations.

preprint2012arXiv

L$_1$ Regularization for Reconstruction of a non-equilibrium Ising Model

The couplings in a sparse asymmetric, asynchronous Ising network are reconstructed using an exact learning algorithm. L$_1$ regularization is used to remove the spurious weak connections that would otherwise be found by simply minimizing the minus likelihood of a finite data set. In order to see how L$_1$ regularization works in detail, we perform the calculation in several ways including (1) by iterative minimization of a cost function equal to minus the log likelihood of the data plus an L$_1$ penalty term, and (2) an approximate scheme based on a quadratic expansion of the cost function around its minimum. In these schemes, we track how connections are pruned as the strength of the L$_1$ penalty is increased from zero to large values. The performance of the methods for various coupling strengths is quantified using ROC curves.

preprint2010arXiv

Network inference using asynchronously updated kinetic Ising Model

Network structures are reconstructed from dynamical data by respectively naive mean field (nMF) and Thouless-Anderson-Palmer (TAP) approximations. For TAP approximation, we use two methods to reconstruct the network: a) iteration method; b) casting the inference formula to a set of cubic equations and solving it directly. We investigate inference of the asymmetric Sherrington- Kirkpatrick (S-K) model using asynchronous update. The solutions of the sets cubic equation depend of temperature T in the S-K model, and a critical temperature Tc is found around 2.1. For T < Tc, the solutions of the cubic equation sets are composed of 1 real root and two conjugate complex roots while for T > Tc there are three real roots. The iteration method is convergent only if the cubic equations have three real solutions. The two methods give same results when the iteration method is convergent. Compared to nMF, TAP is somewhat better at low temperatures, but approaches the same performance as temperature increase. Both methods behave better for longer data length, but for improvement arises, TAP is well pronounced.

Hong-Li Zeng

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

Temporal epistasis inference from more than 3,500,000 SARS-CoV-2 Genomic Sequences

Global analysis of more than 50,000 SARS-Cov-2 genomes reveals epistasis between 8 viral genes

Inferring genetic fitness from genomic data

Inverse Ising techniques to infer underlying mechanisms from data

Longitudinal Support Vector Machines for High Dimensional Time Series

Network reconstruction from asynchronously updated evolutionary game

Maximum likelihood reconstruction for Ising models with asynchronous updates

L$_1$ Regularization for Reconstruction of a non-equilibrium Ising Model

Network inference using asynchronously updated kinetic Ising Model