Source author record

Andrei Zinovyev

Andrei Zinovyev appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Molecular Networks Quantitative Methods Machine Learning Genomics Computational Engineering, Finance, and Science Applications Artificial Intelligence Biomolecules Computation cs.CY Graphics Information Theory Logic in Computer Science math.IT Mathematical Software physics.chem-ph

Catalog footprint

What is connected

19works

16topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Computational challenges of cell cycle analysis using single cell transcriptomics

The cell cycle is one of the most fundamental biological processes important for understanding normal physiology and various pathologies such as cancer. Single cell RNA sequencing technologies give an opportunity to analyse the cell cycle transcriptome dynamics in an unprecedented range of conditions (cell types and perturbations), with thousands of publicly available datasets. Here we review the main computational tasks in such analysis: 1) identification of cell cycle phases, 2) pseudotime inference, 3) identification and profiling of cell cycle-related genes, 4) removing cell cycle effect, 5) identification and analysis of the G0 (quiescent) cells. We review seventeen software packages that are available today for the cell cycle analysis using scRNA-seq data. Despite huge progress achieved, none of the packages can produce complete and reliable results with respect to all aforementioned tasks. One of the major difficulties for existing packages is distinguishing between two patterns of cell cycle transcriptomic dynamics: normal and characteristic for embryonic stem cells (ESC), with the latter one shared by many cancer cell lines. Moreover, some cell lines are characterized by a mixture of two subpopulations, one following the standard and one ESC-like cell cycle, which makes the analysis even more challenging. In conclusion, we discuss the difficulties of the analysis of cell cycle-related single cell transcriptome and provide certain guidelines for the use of the existing methods.

preprint2022arXiv

Quasi-orthogonality and intrinsic dimensions as measures of learning and generalisation

Finding best architectures of learning machines, such as deep neural networks, is a well-known technical and theoretical challenge. Recent work by Mellor et al (2021) showed that there may exist correlations between the accuracies of trained networks and the values of some easily computable measures defined on randomly initialised networks which may enable to search tens of thousands of neural architectures without training. Mellor et al used the Hamming distance evaluated over all ReLU neurons as such a measure. Motivated by these findings, in our work, we ask the question of the existence of other and perhaps more principled measures which could be used as determinants of success of a given neural architecture. In particular, we examine, if the dimensionality and quasi-orthogonality of neural networks' feature space could be correlated with the network's performance after training. We showed, using the setup as in Mellor et al, that dimensionality and quasi-orthogonality may jointly serve as network's performance discriminants. In addition to offering new opportunities to accelerate neural architecture search, our findings suggest important relationships between the networks' final performance and properties of their randomly initialised feature spaces: data dimension and quasi-orthogonality.

preprint2020arXiv

Local intrinsic dimensionality estimators based on concentration of measure

Intrinsic dimensionality (ID) is one of the most fundamental characteristics of multi-dimensional data point clouds. Knowing ID is crucial to choose the appropriate machine learning approach as well as to understand its behavior and validate it. ID can be computed globally for the whole data point distribution, or computed locally in different regions of the data space. In this paper, we introduce new local estimators of ID based on linear separability of multi-dimensional data point clouds, which is one of the manifestations of concentration of measure. We empirically study the properties of these estimators and compare them with other recently introduced ID estimators exploiting various effects of measure concentration. Observed differences between estimators can be used to anticipate their behaviour in practical applications.

preprint2020arXiv

Synthesis of Boolean Networks from Biological Dynamical Constraints using Answer-Set Programming

Boolean networks model finite discrete dynamical systems with complex behaviours. The state of each component is determined by a Boolean function of the state of (a subset of) the components of the network. This paper addresses the synthesis of these Boolean functions from constraints on their domain and emerging dynamical properties of the resulting network. The dynamical properties relate to the existence and absence of trajectories between partially observed configurations, and to the stable behaviours (fixpoints and cyclic attractors). The synthesis is expressed as a Boolean satisfiability problem relying on Answer-Set Programming with a parametrized complexity, and leads to a complete non-redundant characterization of the set of solutions. Considered constraints are particularly suited to address the synthesis of models of cellular differentiation processes, as illustrated on a case study. The scalability of the approach is demonstrated on random networks with scale-free structures up to 100 to 1,000 nodes depending on the type of constraints.

preprint2020arXiv

Trajectories, bifurcations and pseudotime in large clinical datasets: applications to myocardial infarction and diabetes data

Large observational clinical datasets become increasingly available for mining associations between various disease traits and administered therapy. These datasets can be considered as representations of the landscape of all possible disease conditions, in which a concrete pathology develops through a number of stereotypical routes, characterized by `points of no return' and `final states' (such as lethal or recovery states). Extracting this information directly from the data remains challenging, especially in the case of synchronic (with a short-term follow up) observations. Here we suggest a semi-supervised methodology for the analysis of large clinical datasets, characterized by mixed data types and missing values, through modeling the geometrical data structure as a bouquet of bifurcating clinical trajectories. The methodology is based on application of elastic principal graphs which can address simultaneously the tasks of dimensionality reduction, data visualization, clustering, feature selection and quantifying the geodesic distances (pseudotime) in partially ordered sequences of observations. The methodology allows positioning a patient on a particular clinical trajectory (pathological scenario) and characterizing the degree of progression along it with a qualitative estimate of the uncertainty of the prognosis. Overall, our pseudo-time quantification-based approach gives a possibility to apply the methods developed for dynamical disease phenotyping and illness trajectory analysis (diachronic data analysis) to synchronic observational data. We developed a tool $ClinTrajan$ for clinical trajectory analysis implemented in Python programming language. We test the methodology in two large publicly available datasets: myocardial infarction complications and readmission of diabetic patients data.

preprint2019arXiv

Basic, simple and extendable kinetic model of protein synthesis

Protein synthesis is one of the most fundamental biological processes, which consumes a significant amount of cellular resources. Despite existence of multiple mathematical models of translation, varying in the level of mechanistical details, surprisingly, there is no basic and simple chemical kinetic model of this process, derived directly from the detailed kinetic model. One of the reasons for this is that the translation process is characterized by indefinite number of states, thanks to existence of polysomes. We bypass this difficulty by applying a trick consisting in lumping multiple states of translated mRNA into few dynamical variables and by introducing a variable describing the pool of translating ribosomes. The simplest model can be solved analytically under some assumptions. The basic and simple model can be extended, if necessary, to take into account various phenomena such as the interaction between translating ribosomes, limited amount of ribosomal units or regulation of translation by microRNA. The model can be used as a building block (translation module) for more complex models of cellular processes. We demonstrate the utility of the model in two examples. First, we determine the critical parameters of the single protein synthesis for the case when the ribosomal units are abundant. Second, we demonstrate intrinsic bi-stability in the dynamics of the ribosomal protein turnover and predict that a minimal number of ribosomes should pre-exists in a living cell to sustain its protein synthesis machinery, even in the absence of proliferation.

preprint2018arXiv

Robust And Scalable Learning Of Complex Dataset Topologies Via Elpigraph

Large datasets represented by multidimensional data point clouds often possess non-trivial distributions with branching trajectories and excluded regions, with the recent single-cell transcriptomic studies of developing embryo being notable examples. Reducing the complexity and producing compact and interpretable representations of such data remains a challenging task. Most of the existing computational methods are based on exploring the local data point neighbourhood relations, a step that can perform poorly in the case of multidimensional and noisy data. Here we present ElPiGraph, a scalable and robust method for approximation of datasets with complex structures which does not require computing the complete data distance matrix or the data point neighbourhood graph. This method is able to withstand high levels of noise and is capable of approximating complex topologies via principal graph ensembles that can be combined into a consensus principal graph. ElPiGraph deals efficiently with large and complex datasets in various fields from biology, where it can be used to infer gene dynamics from single-cell RNA-Seq, to astronomy, where it can be used to explore complex structures in the distribution of galaxies.

preprint2015arXiv

DeDaL: Cytoscape 3.0 app for producing and morphing data-driven and structure-driven network layouts

Visualization and analysis of molecular profiling data together with biological networks are able to provide new mechanistical insights into biological functions. Currently, high-throughput data are usually visualized on top of predefined network layouts which are not always adapted to a given data analysis task. We developed a Cytoscape app which allows to construct biological network layouts based on the data from molecular profiles imported as values of nodes attributes. DeDaL is a Cytoscape 3.0 app which uses linear and non-linear algorithms of dimension reduction to produce data-driven network layouts based on multidimensional data (typically gene expression). DeDaL implements several data pre-processing and layout post-processing steps such as continuous morphing between two arbitrary network layouts and aligning one network layout with respect to another one by rotating and mirroring. Combining these possibilities facilitates creating insightful network layouts representing both structural network features and the correlation patterns in multivariate data. DeDaL is the first method allowing to construct biological network layouts from high-throughput data. DeDaL is freely available for downloading together with step-by-step tutorial at http://bioinfo-out.curie.fr/projects/dedal/.

preprint2015arXiv

Predicting genetic interactions from Boolean models of biological networks

Genetic interaction can be defined as a deviation of the phenotypic quantitative effect of a double gene mutation from the effect predicted from single mutations using a simple (e.g., multiplicative or linear additive) statistical model. Experimentally characterized genetic interaction networks in model organisms provide important insights into relationships between different biological functions. We describe a computational methodology allowing to systematically and quantitatively characterize a Boolean mathematical model of a biological network in terms of genetic interactions between all loss of function and gain of function mutations with respect to all model phenotypes or outputs. We use the probabilistic framework defined in MaBoSS software, based on continuous time Markov chains and stochastic simulations. In addition, we suggest several computational tools for studying the distribution of double mutants in the space of model phenotype probabilities. We demonstrate this methodology on three published models for each of which we derive the genetic interaction networks and analyze their properties. We classify the obtained interactions according to their class of epistasis, dependence on the chosen initial conditions and phenotype. The use of this methodology for validating mathematical models from experimental data and designing new experiments is discussed.

preprint2014arXiv

Dealing with complexity of biological systems: from data to models

Four chapters of the synthesis represent four major areas of my research interests: 1) data analysis in molecular biology, 2) mathematical modeling of biological networks, 3) genome evolution, and 4) cancer systems biology. The first chapter is devoted to my work in developing non-linear methods of dimension reduction (methods of elastic maps and principal trees) which extends the classical method of principal components. Also I present application of matrix factorization techniques to analysis of cancer data. The second chapter is devoted to the complexity of mathematical models in molecular biology. I describe the basic ideas of asymptotology of chemical reaction networks aiming at dissecting and simplifying complex chemical kinetics models. Two applications of this approach are presented: to modeling NFkB and apoptosis pathways, and to modeling mechanisms of miRNA action on protein translation. The third chapter briefly describes my investigations of the genome structure in different organisms (from microbes to human cancer genomes). Unsupervised data analysis approaches are used to investigate the patterns in genomic sequences shaped by genome evolution and influenced by the basic properties of the environment. The fourth chapter summarizes my experience in studying cancer by computational methods (through combining integrative data analysis and mathematical modeling approaches). In particular, I describe the on-going research projects such as mathematical modeling of cell fate decisions and synthetic lethal interactions in DNA repair network. The synthesis is concluded by listing major challenges in computational systems biology, connected to the topics of this text, i.e. dealing with complexity of biological systems.

preprint2014arXiv

ViDaExpert: user-friendly tool for nonlinear visualization and analysis of multidimensional vectorial data

ViDaExpert is a tool for visualization and analysis of multidimensional vectorial data. ViDaExpert is able to work with data tables of "object-feature" type that might contain numerical feature values as well as textual labels for rows (objects) and columns (features). ViDaExpert implements several statistical methods such as standard and weighted Principal Component Analysis (PCA) and the method of elastic maps (non-linear version of PCA), Linear Discriminant Analysis (LDA), multilinear regression, K-Means clustering, a variant of decision tree construction algorithm. Equipped with several user-friendly dialogs for configuring data point representations (size, shape, color) and fast 3D viewer, ViDaExpert is a handy tool allowing to construct an interactive 3D-scene representing a table of data in multidimensional space and perform its quick and insightfull statistical analysis, from basic to advanced methods.

preprint2013arXiv

Blind source separation methods for deconvolution of complex signals in cancer biology

Two blind source separation methods (Independent Component Analysis and Non-negative Matrix Factorization), developed initially for signal processing in engineering, found recently a number of applications in analysis of large-scale data in molecular biology. In this short review, we present the common idea behind these methods, describe ways of implementing and applying them and point out to the advantages compared to more traditional statistical approaches. We focus more specifically on the analysis of gene expression in cancer. The review is finalized by listing available software implementations for the methods described.

preprint2013arXiv

Cell death and life in cancer: mathematical modeling of cell fate decisions

Tumor development is characterized by a compromised balance between cell life and death decision mechanisms, which are tighly regulated in normal cells. Understanding this process provides insights for developing new treatments for fighting with cancer. We present a study of a mathematical model describing cellular choice between survival and two alternative cell death modalities: apoptosis and necrosis. The model is implemented in discrete modeling formalism and allows to predict probabilities of having a particular cellular phenotype in response to engagement of cell death receptors. Using an original parameter sensitivity analysis developed for discrete dynamic systems, we determine the critical parameters affecting cellular fate decision variables that appear to be critical in the cellular fate decision and discuss how they are exploited by existing cancer therapies.

preprint2013arXiv

Data complexity measured by principal graphs

How to measure the complexity of a finite set of vectors embedded in a multidimensional space? This is a non-trivial question which can be approached in many different ways. Here we suggest a set of data complexity measures using universal approximators, principal cubic complexes. Principal cubic complexes generalise the notion of principal manifolds for datasets with non-trivial topologies. The type of the principal cubic complex is determined by its dimension and a grammar of elementary graph transformations. The simplest grammar produces principal trees. We introduce three natural types of data complexity: 1) geometric (deviation of the data's approximator from some "idealized" configuration, such as deviation from harmonicity); 2) structural (how many elements of a principal graph are needed to approximate the data), and 3) construction complexity (how many applications of elementary graph transformations are needed to construct the principal object starting from the simplest one). We compute these measures for several simulated and real-life data distributions and show them in the "accuracy-complexity" plots, helping to optimize the accuracy/complexity ratio. We discuss various issues connected with measuring data complexity. Software for computing data complexity measures from principal cubic complexes is provided as well.

preprint2013arXiv

Model composition through model reduction: a combined model of CD95 and NF-κB signaling pathways

We propose a new approach to model composition, based on reducing several models to the same level of complexity and subsequent combining them together. Firstly, we suggest a set of model reduction tools that can be systematically applied to a given model. Secondly, we suggest a notion of a minimal complexity model. This model is the simplest one that can be obtained from the original model using these tools and still able to approximate experimental data. Thirdly, we propose a strategy for composing the reduced models together. Connection with the detailed model is preserved, which can be advantageous in some applications. A toolbox for model reduction and composition has been implemented as part of the BioUML software and tested on the example of integrating two previously published models of the CD95 (APO-1/Fas) signaling pathways. We show that the reduced models lead to the same dynamical behavior of observable species and the same predictions as in the precursor models. The composite model is able to recapitulate several experimental datasets which were used by the authors of the original models to calibrate them separately, but also has new dynamical properties.

preprint2013arXiv

NaviCell: a web-based environment for navigation, curation and maintenance of large molecular interaction maps

Molecular biology knowledge can be systematically represented in a computer-readable form as a comprehensive map of molecular interactions. There exist a number of maps of molecular interactions containing detailed description of various cell mechanisms. It is difficult to explore these large maps, to comment their content and to maintain them. Though there exist several tools addressing these problems individually, the scientific community still lacks an environment that combines these three capabilities together. NaviCell is a web-based environment for exploiting large maps of molecular interactions, created in CellDesigner, allowing their easy exploration, curation and maintenance. NaviCell combines three features: (1) efficient map browsing based on Google Maps engine; (2) semantic zooming for viewing different levels of details or of abstraction of the map and (3) integrated web-based blog for collecting the community feedback. NaviCell can be easily used by experts in the field of molecular biology for studying molecular entities of their interest in the context of signaling pathways and cross-talks between pathways within a global signaling network. NaviCell allows both exploration of detailed molecular mechanisms represented on the map and a more abstract view of the map up to a top-level modular representation. NaviCell facilitates curation, maintenance and updating the comprehensive maps of molecular interactions in an interactive fashion due to an imbedded blogging system. NaviCell provides an easy way to explore large-scale maps of molecular interactions, thanks to the Google Maps and WordPress interfaces, already familiar to many users. Semantic zooming used for navigating geographical maps is adopted for molecular maps in NaviCell, making any level of visualization meaningful to the user. In addition, NaviCell provides a framework for community-based map curation.

preprint2012arXiv

Reduction of dynamical biochemical reaction networks in computational biology

Biochemical networks are used in computational biology, to model the static and dynamical details of systems involved in cell signaling, metabolism, and regulation of gene expression. Parametric and structural uncertainty, as well as combinatorial explosion are strong obstacles against analyzing the dynamics of large models of this type. Multi-scaleness is another property of these networks, that can be used to get past some of these obstacles. Networks with many well separated time scales, can be reduced to simpler networks, in a way that depends only on the orders of magnitude and not on the exact values of the kinetic parameters. The main idea used for such robust simplifications of networks is the concept of dominance among model elements, allowing hierarchical organization of these elements according to their effects on the network dynamics. This concept finds a natural formulation in tropical geometry. We revisit, in the light of these new ideas, the main approaches to model reduction of reaction networks, such as quasi-steady state and quasi-equilibrium approximations, and provide practical recipes for model reduction of linear and nonlinear networks. We also discuss the application of model reduction to backward pruning machine learning techniques.

preprint2010arXiv

Data visualization in political and social sciences

The basic objective of data visualization is to provide an efficient graphical display for summarizing and reasoning about quantitative information. During the last decades, political science has accumulated a large corpus of various kinds of data such as comprehensive factbooks and atlases, characterizing all or most of existing states by multiple and objectively assessed numerical indicators within certain time lapse. As a consequence, there exists a continuous trend for political science to gradually become a more quantitative scientific field and to use quantitative information in the analysis and reasoning. It is believed that any objective analysis in political science must be multidimensional and combine various sources of quantitative information; however, human capabilities for perception of large massifs of numerical information are limited. Hence, methods and approaches for visualization of quantitative and qualitative data (and, especially multivariate data) is an extremely important topic. Data visualization approaches can be classified into several groups, starting from creating informative charts and diagrams (statistical graphics and infographics) and ending with advanced statistical methods for visualizing multidimensional tables containing both quantitative and qualitative information. In this article we provide a short review of existing methods of data visualization methods with applications in political and social science.

preprint2009arXiv

Dynamical modeling of microRNA action on the protein translation process

Protein translation is a multistep process which can be represented as a cascade of biochemical reactions (initiation, ribosome assembly, elongation, etc.), the rate of which can be regulated by small non-coding microRNAs through multiple mechanisms. It remains unclear what mechanisms of microRNA action are most dominant: moreover, many experimental reports deliver controversal messages on what is the concrete mechanism actually observed in the experiment. Parker and Nissan (Parker and Nissan, RNA, 2008) demonstrated that it is impossible to distinguish alternative biological hypotheses using the steady state data on the rate of protein synthesis. For their analysis they used two simple kinetic models of protein translation. In contrary, we show that dynamical data allow to discriminate some of the mechanisms of microRNA action. We demonstrate this using the same models as in (Parker and Nissan, RNA, 2008) for the sake of comparison but the methods developed (asymptotology of biochemical networks) can be used for other models. As one of the results of our analysis, we formulate a hypothesis that the effect of microRNA action is measurable and observable only if it affects the dominant system (generalization of the limiting step notion for complex networks) of the protein translation machinery. The dominant system can vary in different experimental conditions that can partially explain the existing controversy of some of the experimental data.

Andrei Zinovyev

What is connected

Connect this record

See the researcher in context

Building this map preview

19 published item(s)

Computational challenges of cell cycle analysis using single cell transcriptomics

Quasi-orthogonality and intrinsic dimensions as measures of learning and generalisation

Local intrinsic dimensionality estimators based on concentration of measure

Synthesis of Boolean Networks from Biological Dynamical Constraints using Answer-Set Programming

Trajectories, bifurcations and pseudotime in large clinical datasets: applications to myocardial infarction and diabetes data

Basic, simple and extendable kinetic model of protein synthesis

Robust And Scalable Learning Of Complex Dataset Topologies Via Elpigraph

DeDaL: Cytoscape 3.0 app for producing and morphing data-driven and structure-driven network layouts

Predicting genetic interactions from Boolean models of biological networks

Dealing with complexity of biological systems: from data to models

ViDaExpert: user-friendly tool for nonlinear visualization and analysis of multidimensional vectorial data

Blind source separation methods for deconvolution of complex signals in cancer biology

Cell death and life in cancer: mathematical modeling of cell fate decisions

Data complexity measured by principal graphs

Model composition through model reduction: a combined model of CD95 and NF-κB signaling pathways

NaviCell: a web-based environment for navigation, curation and maintenance of large molecular interaction maps

Reduction of dynamical biochemical reaction networks in computational biology

Data visualization in political and social sciences

Dynamical modeling of microRNA action on the protein translation process