Source author record

Gregory Nuel

Gregory Nuel appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Computation Genomics Information Theory Machine Learning math.IT math.PR Methodology

Catalog footprint

What is connected

9works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

Semi-Parametric Survival Estimation for pedigrees

Mendelian diseases are determined by a single mutation in a given gene. However, in the case of diseases with late onset, the age at onset is variable; it can even be the case that the onset is not observed in a lifetime. Estimating the survival function of the mutation carriers and the effect of modifying factors such as the sex, mutation, origin, etc, is a task of importance, both for management of mutation carriers and for prevention. In this work, we present a semi-parametric method based on a proportional to estimate the survival function using pedigrees ascertained through affected individuals (probands). Not all members of the pedigree need to be genotyped. The ascertainment bias is corrected by using only the phenotypic information from the relatives of the proband, and not of the proband himself. The method manage ungenotyped individuals through belief propagation in Bayesian networks and uses an EM algorithm to compute a Kaplan-Meier estimator of the survival function. The method is illustrated on simulated data and on a samples of families with transthyretin-related hereditary amyloidosis, a rare autosomal dominant disease with highly variable age of onset.

preprint2015arXiv

An adaptive Ridge procedure for L0 regularization

Penalized selection criteria like AIC or BIC are among the most popular methods for variable selection. Their theoretical properties have been studied intensively and are well understood, but making use of them in case of high-dimensional data is difficult due to the non-convex optimization problem induced by L0 penalties. An elegant solution to this problem is provided by the multi-step adaptive lasso, where iteratively weighted lasso problems are solved, whose weights are updated in such a way that the procedure converges towards selection with L0 penalties. In this paper we introduce an adaptive ridge procedure (AR) which mimics the adaptive lasso, but is based on weighted Ridge problems. After introducing AR its theoretical properties are studied in the particular case of orthogonal linear regression. For the non-orthogonal case extensive simulations are performed to assess the performance of AR. In case of Poisson regression and logistic regression it is illustrated how the iterative procedure of AR can be combined with iterative maximization procedures. The paper ends with an efficient implementation of AR in the context of least-squares segmentation.

preprint2014arXiv

Non-subjective power analysis to detect G*E interactions in Genome-Wide Association Studies in presence of confounding factor

It is generally acknowledged that most complex diseases are affected in part by interactions between genes and genes and/or between genes and environmental factors. Taking into account environmental exposures and their interactions with genetic factors in genome-wide association studies (GWAS) can help to identify high-risk subgroups in the population and provide a better understanding of the disease. For this reason, many methods have been developed to detect gene-environment (G*E) interactions. Despite this, few loci that interact with environmental exposures have been identified so far. Indeed, the modest effect of G*E interactions as well as confounding factors entail low statistical power to detect such interactions. In this work, we provide a simulated dataset in order to study methods for detecting G*E interactions in GWAS in presence of confounding factor and population structure. Our work applies a recently introduced non-subjective method for H1 simulations called waffect and exploits the publicly available HapMap project to build a datasets with real genotypes and population structures. We use this dataset to study the impact of confounding factors and compare the relative performance of popular methods such as PLINK, random forests and linear mixed models to detect G*E interactions. Presence of confounding factor is an obstacle to detect G*E interactions in GWAS and the approaches considered in our power study all have insufficient power to detect the strong simulated interaction. Our simulated dataset could help to develop new methods which account for confounding factors through latent exposures in order to improve power.

preprint2013arXiv

Fast estimation of posterior probabilities in change-point models through a constrained hidden Markov model

The detection of change-points in heterogeneous sequences is a statistical challenge with applications across a wide variety of fields. In bioinformatics, a vast amount of methodology exists to identify an ideal set of change-points for detecting Copy Number Variation (CNV). While considerable efficient algorithms are currently available for finding the best segmentation of the data in CNV, relatively few approaches consider the important problem of assessing the uncertainty of the change-point location. Asymptotic and stochastic approaches exist but often require additional model assumptions to speed up the computations, while exact methods have quadratic complexity which usually are intractable for large datasets of tens of thousands points or more. In this paper, we suggest an exact method for obtaining the posterior distribution of change-points with linear complexity, based on a constrained hidden Markov model. The methods are implemented in the R package postCP, which uses the results of a given change-point detection algorithm to estimate the probability that each observation is a change-point. We present the results of the package on a publicly available CNV data set (n=120). Due to its frequentist framework, postCP obtains less conservative confidence intervals than previously published Bayesian methods, but with linear complexity instead of quadratic. Simulations showed that postCP provided comparable loss to a Bayesian MCMC method when estimating posterior means, specifically when assessing larger-scale changes, while being more computationally efficient. On another high-resolution CNV data set (n=14,241), the implementation processed information in less than one second on a mid-range laptop computer.

preprint2013arXiv

Fast estimation of the ICL criterion for change-point detection problems with applications to Next-Generation Sequencing data

In this paper, we consider the Integrated Completed Likelihood (ICL) as a useful criterion for estimating the number of changes in the underlying distribution of data in problems where detecting the precise location of these changes is the main goal. The exact computation of the ICL requires O(Kn2) operations (with K the number of segments and n the number of data-points) which is prohibitive in many practical situations with large sequences of data. We describe a framework to estimate the ICL with O(Kn) complexity. Our approach is general in the sense that it can accommodate any given model distribution. We checked the run-time and validity of our approach on simulated data and demonstrate its good performance when analyzing real Next-Generation Sequencing (NGS) data using a negative binomial model.

preprint2013arXiv

From GWAS to transcriptomics in prospective cancer design - new statistical challenges

Background. With the increasing interest in post-GWAS research which represents a transition from genome-wide association discovery to analysis of functional mechanisms, attention has been lately focused on the potential of including various biological material in epidemiological studies. In particular, exploration of the carcinogenic process through transcriptional analysis at the epidemiological level opens up new horizons in functional analysis and causal inference, and requires a new design together with adequate analysis procedures. Results. In this article, we present the post-genome design implemented in the NOWAC cohort as an example of a prospective nested case-control study built for transcriptomics use, and discuss analytical strategies to explore the changes occurring in transcriptomics during the carcinogenic process in association with questionnaire information. We emphasize the inadequacy of survival analysis models usually considered in GWAS for post-genome design, and propose instead to parameterize the gene trajectories during the carcinogenic process. Conclusions. This novel approach, in which transcriptomics are considered as potential intermediate biomarkers of cancer and exposures, offers a flexible framework which can include various biological assumptions.

preprint2012arXiv

Alternative Methods for H1 Simulations in Genome Wide Association Studies

Assessing the statistical power to detect susceptibility variants plays a critical role in GWA studies both from the prospective and retrospective points of view. Power is empirically estimated by simulating phenotypes under a disease model H1. For this purpose, the "gold" standard consists in simulating genotypes given the phenotypes (e.g. Hapgen). We introduce here an alternative approach for simulating phenotypes under H1 that does not require generating new genotypes for each simulation. In order to simulate phenotypes with a fixed total number of cases and under a given disease model, we suggest three algorithms: i) a simple rejection algorithm; ii) a numerical Markov Chain Monte-Carlo (MCMC) approach; iii) and an exact and efficient backward sampling algorithm. In our study, we validated the three algorithms both on a toy-dataset and by comparing them with Hapgen on a more realistic dataset. As an application, we then conducted a simulation study on a 1000 Genomes Project dataset consisting of 629 individuals (314 cases) and 8,048 SNPs from Chromosome X. We arbitrarily defined an additive disease model with two susceptibility SNPs and an epistatic effect. The three algorithms are consistent, but backward sampling is dramatically faster than the other two. Our approach also gives consistent results with Hapgen. Using our application data, we showed that our limited design requires a biological a priori to limit the investigated region. We also proved that epistatic effects can play a significant role even when simple marker statistics (e.g. trend) are used. We finally showed that the overall performance of a GWA study strongly depends on the prevalence of the disease: the larger the prevalence, the better the power.

preprint2012arXiv

Hidden Markov Model Applications in Change-Point Analysis

The detection of change-points in heterogeneous sequences is a statistical challenge with many applications in fields such as finance, signal analysis and biology. A wide variety of literature exists for finding an ideal set of change-points for characterizing the data. In this tutorial we elaborate on the Hidden Markov Model (HMM) and present two different frameworks for applying HMM to change-point models. Then we provide a summary of two procedures for inference in change-point analysis, which are particular cases of the forward-backward algorithm for HMMs, and discuss common implementation problems. Lastly, we provide two examples of the HMM methods on available data sets and we shortly discuss about the applications to current genomics studies. The R code used in the examples is provided in the appendix.

preprint2012arXiv

Measuring the Influence of Observations in HMMs through the Kullback-Leibler Distance

We measure the influence of individual observations on the sequence of the hidden states of the Hidden Markov Model (HMM) by means of the Kullback-Leibler distance (KLD). Namely, we consider the KLD between the conditional distribution of the hidden states' chain given the complete sequence of observations and the conditional distribution of the hidden chain given all the observations but the one under consideration. We introduce a linear complexity algorithm for computing the influence of all the observations. As an illustration, we investigate the application of our algorithm to the problem of detecting outliers in HMM data series.

Gregory Nuel

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

Semi-Parametric Survival Estimation for pedigrees

An adaptive Ridge procedure for L0 regularization

Non-subjective power analysis to detect G*E interactions in Genome-Wide Association Studies in presence of confounding factor

Fast estimation of posterior probabilities in change-point models through a constrained hidden Markov model

Fast estimation of the ICL criterion for change-point detection problems with applications to Next-Generation Sequencing data

From GWAS to transcriptomics in prospective cancer design - new statistical challenges

Alternative Methods for H1 Simulations in Genome Wide Association Studies

Hidden Markov Model Applications in Change-Point Analysis

Measuring the Influence of Observations in HMMs through the Kullback-Leibler Distance