Researcher profile

Andrei Zinovyev

Andrei Zinovyev contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2022arXiv

Computational challenges of cell cycle analysis using single cell transcriptomics

The cell cycle is one of the most fundamental biological processes important for understanding normal physiology and various pathologies such as cancer. Single cell RNA sequencing technologies give an opportunity to analyse the cell cycle transcriptome dynamics in an unprecedented range of conditions (cell types and perturbations), with thousands of publicly available datasets. Here we review the main computational tasks in such analysis: 1) identification of cell cycle phases, 2) pseudotime inference, 3) identification and profiling of cell cycle-related genes, 4) removing cell cycle effect, 5) identification and analysis of the G0 (quiescent) cells. We review seventeen software packages that are available today for the cell cycle analysis using scRNA-seq data. Despite huge progress achieved, none of the packages can produce complete and reliable results with respect to all aforementioned tasks. One of the major difficulties for existing packages is distinguishing between two patterns of cell cycle transcriptomic dynamics: normal and characteristic for embryonic stem cells (ESC), with the latter one shared by many cancer cell lines. Moreover, some cell lines are characterized by a mixture of two subpopulations, one following the standard and one ESC-like cell cycle, which makes the analysis even more challenging. In conclusion, we discuss the difficulties of the analysis of cell cycle-related single cell transcriptome and provide certain guidelines for the use of the existing methods.

preprint2022arXiv

Quasi-orthogonality and intrinsic dimensions as measures of learning and generalisation

Finding best architectures of learning machines, such as deep neural networks, is a well-known technical and theoretical challenge. Recent work by Mellor et al (2021) showed that there may exist correlations between the accuracies of trained networks and the values of some easily computable measures defined on randomly initialised networks which may enable to search tens of thousands of neural architectures without training. Mellor et al used the Hamming distance evaluated over all ReLU neurons as such a measure. Motivated by these findings, in our work, we ask the question of the existence of other and perhaps more principled measures which could be used as determinants of success of a given neural architecture. In particular, we examine, if the dimensionality and quasi-orthogonality of neural networks' feature space could be correlated with the network's performance after training. We showed, using the setup as in Mellor et al, that dimensionality and quasi-orthogonality may jointly serve as network's performance discriminants. In addition to offering new opportunities to accelerate neural architecture search, our findings suggest important relationships between the networks' final performance and properties of their randomly initialised feature spaces: data dimension and quasi-orthogonality.

preprint2020arXiv

Local intrinsic dimensionality estimators based on concentration of measure

Intrinsic dimensionality (ID) is one of the most fundamental characteristics of multi-dimensional data point clouds. Knowing ID is crucial to choose the appropriate machine learning approach as well as to understand its behavior and validate it. ID can be computed globally for the whole data point distribution, or computed locally in different regions of the data space. In this paper, we introduce new local estimators of ID based on linear separability of multi-dimensional data point clouds, which is one of the manifestations of concentration of measure. We empirically study the properties of these estimators and compare them with other recently introduced ID estimators exploiting various effects of measure concentration. Observed differences between estimators can be used to anticipate their behaviour in practical applications.

preprint2020arXiv

Synthesis of Boolean Networks from Biological Dynamical Constraints using Answer-Set Programming

Boolean networks model finite discrete dynamical systems with complex behaviours. The state of each component is determined by a Boolean function of the state of (a subset of) the components of the network. This paper addresses the synthesis of these Boolean functions from constraints on their domain and emerging dynamical properties of the resulting network. The dynamical properties relate to the existence and absence of trajectories between partially observed configurations, and to the stable behaviours (fixpoints and cyclic attractors). The synthesis is expressed as a Boolean satisfiability problem relying on Answer-Set Programming with a parametrized complexity, and leads to a complete non-redundant characterization of the set of solutions. Considered constraints are particularly suited to address the synthesis of models of cellular differentiation processes, as illustrated on a case study. The scalability of the approach is demonstrated on random networks with scale-free structures up to 100 to 1,000 nodes depending on the type of constraints.

preprint2020arXiv

Trajectories, bifurcations and pseudotime in large clinical datasets: applications to myocardial infarction and diabetes data

Large observational clinical datasets become increasingly available for mining associations between various disease traits and administered therapy. These datasets can be considered as representations of the landscape of all possible disease conditions, in which a concrete pathology develops through a number of stereotypical routes, characterized by `points of no return' and `final states' (such as lethal or recovery states). Extracting this information directly from the data remains challenging, especially in the case of synchronic (with a short-term follow up) observations. Here we suggest a semi-supervised methodology for the analysis of large clinical datasets, characterized by mixed data types and missing values, through modeling the geometrical data structure as a bouquet of bifurcating clinical trajectories. The methodology is based on application of elastic principal graphs which can address simultaneously the tasks of dimensionality reduction, data visualization, clustering, feature selection and quantifying the geodesic distances (pseudotime) in partially ordered sequences of observations. The methodology allows positioning a patient on a particular clinical trajectory (pathological scenario) and characterizing the degree of progression along it with a qualitative estimate of the uncertainty of the prognosis. Overall, our pseudo-time quantification-based approach gives a possibility to apply the methods developed for dynamical disease phenotyping and illness trajectory analysis (diachronic data analysis) to synchronic observational data. We developed a tool $ClinTrajan$ for clinical trajectory analysis implemented in Python programming language. We test the methodology in two large publicly available datasets: myocardial infarction complications and readmission of diabetic patients data.

preprint2019arXiv

Basic, simple and extendable kinetic model of protein synthesis

Protein synthesis is one of the most fundamental biological processes, which consumes a significant amount of cellular resources. Despite existence of multiple mathematical models of translation, varying in the level of mechanistical details, surprisingly, there is no basic and simple chemical kinetic model of this process, derived directly from the detailed kinetic model. One of the reasons for this is that the translation process is characterized by indefinite number of states, thanks to existence of polysomes. We bypass this difficulty by applying a trick consisting in lumping multiple states of translated mRNA into few dynamical variables and by introducing a variable describing the pool of translating ribosomes. The simplest model can be solved analytically under some assumptions. The basic and simple model can be extended, if necessary, to take into account various phenomena such as the interaction between translating ribosomes, limited amount of ribosomal units or regulation of translation by microRNA. The model can be used as a building block (translation module) for more complex models of cellular processes. We demonstrate the utility of the model in two examples. First, we determine the critical parameters of the single protein synthesis for the case when the ribosomal units are abundant. Second, we demonstrate intrinsic bi-stability in the dynamics of the ribosomal protein turnover and predict that a minimal number of ribosomes should pre-exists in a living cell to sustain its protein synthesis machinery, even in the absence of proliferation.

preprint2018arXiv

Robust And Scalable Learning Of Complex Dataset Topologies Via Elpigraph

Large datasets represented by multidimensional data point clouds often possess non-trivial distributions with branching trajectories and excluded regions, with the recent single-cell transcriptomic studies of developing embryo being notable examples. Reducing the complexity and producing compact and interpretable representations of such data remains a challenging task. Most of the existing computational methods are based on exploring the local data point neighbourhood relations, a step that can perform poorly in the case of multidimensional and noisy data. Here we present ElPiGraph, a scalable and robust method for approximation of datasets with complex structures which does not require computing the complete data distance matrix or the data point neighbourhood graph. This method is able to withstand high levels of noise and is capable of approximating complex topologies via principal graph ensembles that can be combined into a consensus principal graph. ElPiGraph deals efficiently with large and complex datasets in various fields from biology, where it can be used to infer gene dynamics from single-cell RNA-Seq, to astronomy, where it can be used to explore complex structures in the distribution of galaxies.