Source author record

Céline Lévy-Leduc

Céline Lévy-Leduc appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory Methodology Applications Computation Networking and Internet Architecture

Catalog footprint

What is connected

15works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Identification of prognostic and predictive biomarkers in high-dimensional data with PPLasso

In clinical trials, identification of prognostic and predictive biomarkers is essential to precision medicine. Prognostic biomarkers can be useful for the prevention of the occurrence of the disease, and predictive biomarkers can be used to identify patients with potential benefit from the treatment. Previous researches were mainly focused on clinical characteristics, and the use of genomic data in such an area is hardly studied. A new method is required to simultaneously select prognostic and predictive biomarkers in high dimensional genomic data where biomarkers are highly correlated. We propose a novel approach called PPLasso (Prognostic Predictive Lasso) integrating prognostic and predictive effects into one statistical model. PPLasso also takes into account the correlations between biomarkers that can alter the biomarker selection accuracy. Our method consists in transforming the design matrix to remove the correlations between the biomarkers before applying the generalized Lasso. In a comprehensive numerical evaluation, we show that PPLasso outperforms the traditional Lasso approach on both prognostic and predictive biomarker identification in various scenarios. Finally, our method is applied to publicly available transcriptomic data from clinical trial RV144. Our method is implemented in the PPLasso R package available from the Comprehensive R Archive Network (CRAN).

preprint2022arXiv

Variable selection in high-dimensional logistic regression models using a whitening approach

In bioinformatics, the rapid development of sequencing technology has enabled us to collect an increasing amount of omics data. Classification based on omics data is one of the central problems in biomedical research. However, omics data usually has a limited sample size but high feature dimensions, and it is assumed that only a few features (biomarkers) are active, i.e. informative to discriminate between different categories (cancer subtypes, responder/non-responder to treatment, for example). Identifying active biomarkers for classification has therefore become fundamental for omics data analysis. Focusing on binary classification, we propose an innovative feature selection method aiming at dealing with the high correlations between the biomarkers. Various research has shown the notorious influence of correlated biomarkers and the difficulty of accurately identifying active ones. Our method, WLogit, consists in whitening the design matrix to remove the correlations between biomarkers, then using a penalized criterion adapted to the logistic regression model to select features. The performance of WLogit is assessed using synthetic data in several scenarios and compared with other approaches. The results suggest that WLogit can identify almost all active biomarkers even in the cases where the biomarkers are highly correlated, while the other methods fail, which consequently leads to higher classification accuracy. The performance is also evaluated on the classification of two Lymphoma subtypes, and the obtained classifier also outperformed other methods. Our method is implemented in the \texttt{WLogit} R package available from the Comprehensive R Archive Network (CRAN).

preprint2022arXiv

Variable selection in sparse GLARMA models

In this paper, we propose a novel and efficient two-stage variable selection approach for sparse GLARMA models, which are pervasive for modeling discrete-valued time series. Our approach consists in iteratively combining the estimation of the autoregressive moving average (ARMA) coefficients of GLARMA models with regularized methods designed for performing variable selection in regression coefficients of Generalized Linear Models (GLM). We first establish the consistency of the ARMA part coefficient estimators in a specific case. Then, we explain how to efficiently implement our approach. Finally, we assess the performance of our methodology using synthetic data, compare it with alternative methods and illustrate it on an example of real-world application. Our approach, which is implemented in the GlarmaVarSel R package and available on the CRAN, is very attractive since it benefits from a low computational load and is able to outperform the other methods in terms of coefficient estimation, particularly in recovering the non null regression coefficients.

preprint2020arXiv

A variable selection approach for highly correlated predictors in high-dimensional genomic data

In genomic studies, identifying biomarkers associated with a variable of interest is a major concern in biomedical research. Regularized approaches are classically used to perform variable selection in high-dimensional linear models. However, these methods can fail in highly correlated settings. We propose a novel variable selection approach called WLasso, taking these correlations into account. It consists in rewriting the initial high-dimensional linear model to remove the correlation between the biomarkers (predictors) and in applying the generalized Lasso criterion. The performance of WLasso is assessed using synthetic data in several scenarios and compared with recent alternative approaches. The results show that when the biomarkers are highly correlated, WLasso outperforms the other approaches in sparse high-dimensional frameworks. The method is also successfully illustrated on publicly available gene expression data in breast cancer. Our method is implemented in the WLasso R package which is available from the Comprehensive R Archive Network.

preprint2016arXiv

Fast Detection of Block Boundaries in Block Wise Constant Matrices: An Application to HiC data

We propose a novel approach for estimating the location of block boundaries (change-points) in a random matrix consisting of a block wise constant matrix observed in white noise. Our method consists in rephrasing this task as a variable selection issue. We use a penalized least-squares criterion with an $\ell_1$-type penalty for dealing with this issue. We first provide some theoretical results ensuring the consistency of our change-point estimators. Then, we explain how to implement our method in a very efficient way. Finally, we provide some empirical evidence to support our claims and apply our approach to HiC data which are used in molecular biology for better understanding the influence of the chromosomal conformation on the cells functioning.

preprint2016arXiv

Improving heritability estimation by a variable selection approach in sparse high dimensional linear mixed models

Motivated by applications in neuroanatomy, we propose a novel methodology for estimating the heritability which corresponds to the proportion of phenotypic variance which can be explained by genetic factors. Estimating this quantity for neuroanatomical features is a fundamental challenge in psychiatric disease research. Since the phenotypic variations may only be due to a small fraction of the available genetic information, we propose an estimator of the heritability that can be used in high dimensional sparse linear mixed models. Our method consists of three steps. Firstly, a variable selection stage is performed in order to recover the support of the genetic effects -- also called causal variants -- that is to find the genetic effects which really explain the phenotypic variations. Secondly, we propose a maximum likelihood strategy for estimating the heritability which only takes into account the causal genetic effects found in the first step. Thirdly, we compute the standard error and the 95% confidence interval associated to our heritability estimator thanks to a nonparametric bootsrap approach. Our main contribution consists in providing an estimation of the heritability with standard errors substantially smaller than methods without variable selection when the genetic effects are very sparse. Since the real genetic architecture is in general unknown in practice, we also propose an empirical criterion which allows the user to decide whether it is relevant to apply a variable selection based approach or not. We illustrate the performance of our methodology on synthetic and real neuroanatomic data coming from the Imagen project. We also show that our approach has a very low computational burden and is very efficient from a statistical point of view.

preprint2016arXiv

Nonparametric homogeneity tests and multiple change-point estimation for analyzing large Hi-C data matrices

We propose a novel nonparametric approach for estimating the location of block boundaries (change-points) of non-overlapping blocks in a random symmetric matrix which consists of random variables having their distribution changing from one block to the other. Our method is based on a nonparametric two-sample homogeneity test for matrices that we extend to the more general case of several groups. We first provide some theoretical results for the two associated test statistics and we explain how to derive change-point location estimators. Then, some numerical experiments are given in order to support our claims. Finally, our approach is applied to Hi-C data which are used in molecular biology for better understanding the influence of the chromosomal conformation on the cells functioning.

preprint2015arXiv

A robust approach for estimating change-points in the mean of an AR(1) process

We consider the problem of multiple change-point estimation in the mean of a Gaussian AR(1) process. Taking into account the dependence structure does not allow us to use the dynamic programming algorithm, which is the only algorithm giving the optimal solution in the independent case. We propose a robust estimator of the autocorrelation parameter, which is consistent and satisfies a central limit theorem. Then, we propose to follow the classical inference approach, by plugging this estimator in the criteria used for change-points estimation. We show that the asymptotic properties of these estimators are the same as those of the classical estimators in the independent framework. The same plug-in approach is then used to approximate the modified BIC and choose the number of segments. This method is implemented in the R package AR1seg and is available from the Comprehensive R Archive Network (CRAN). This package is used in the simulation section in which we show that for finite sample sizes taking into account the dependence structure improves the statistical performance of the change-point estimators and of the selection criterion.

preprint2015arXiv

Heritability estimation in high dimensional linear mixed models

Motivated by applications in genetic fields, we propose to estimate the heritability in high dimensional sparse linear mixed models. The heritability determines how the variance is shared between the different random components of a linear mixed model. The main novelty of our approach is to consider that the random effects can be sparse, that is may contain null components, but we do not know neither their proportion nor their positions. The estimator that we consider is strongly inspired by the one proposed by Pirinen et al. (2013), and is based on a maximum likelihood approach. We also study the theoretical properties of our estimator, namely we establish that our estimator of the heritability is $\sqrt{n}$-consistent when both the number of observations $n$ and the number of random effects $N$ tend to infinity under mild assumptions. We also prove that our estimator of the heritability satisfies a central limit theorem which gives as a byproduct a confidence interval for the heritability. Some Monte-Carlo experiments are also conducted in order to show the finite sample performances of our estimator.

preprint2012arXiv

Homogeneity and change-point detection tests for multivariate data using rank statistics

Detecting and locating changes in highly multivariate data is a major concern in several current statistical applications. In this context, the first contribution of the paper is a novel non-parametric two-sample homogeneity test for multivariate data based on the well-known Wilcoxon rank statistic. The proposed two-sample homogeneity test statistic can be extended to deal with ordinal or censored data as well as to test for the homogeneity of more than two samples. The second contribution of the paper concerns the use of the proposed test statistic to perform retrospective change-point analysis. It is first shown that the approach is computationally feasible even when looking for a large number of change-points thanks to the use of dynamic programming. Computable asymptotic $p$-values for the test are then provided in the case where a single potential change-point is to be detected. Compared to available alternatives, the proposed approach appears to be very reliable and robust. This is particularly true in situations where the data is contaminated by outliers or corrupted by noise and where the potential changes only affect subsets of the coordinates of the data.

preprint2011arXiv

Distributed detection/localization of change-points in high-dimensional network traffic data

We propose a novel approach for distributed statistical detection of change-points in high-volume network traffic. We consider more specifically the task of detecting and identifying the targets of Distributed Denial of Service (DDoS) attacks. The proposed algorithm, called DTopRank, performs distributed network anomaly detection by aggregating the partial information gathered in a set of network monitors. In order to address massive data while limiting the communication overhead within the network, the approach combines record filtering at the monitor level and a nonparametric rank test for doubly censored time series at the central decision site. The performance of the DTopRank algorithm is illustrated both on synthetic data as well as from a traffic trace provided by a major Internet service provider.

preprint2011arXiv

OMP-type Algorithm with Structured Sparsity Patterns for Multipath Radar Signals

A transmitted, unknown radar signal is observed at the receiver through more than one path in additive noise. The aim is to recover the waveform of the intercepted signal and to simultaneously estimate the direction of arrival (DOA). We propose an approach exploiting the parsimonious time-frequency representation of the signal by applying a new OMP-type algorithm for structured sparsity patterns. An important issue is the scalability of the proposed algorithm since high-dimensional models shall be used for radar signals. Monte-Carlo simulations for modulated signals illustrate the good performance of the method even for low signal-to-noise ratios and a gain of 20 dB for the DOA estimation compared to some elementary method.

preprint2011arXiv

Robust Retrospective Multiple Change-point Estimation for Multivariate Data

We propose a non-parametric statistical procedure for detecting multiple change-points in multidimensional signals. The method is based on a test statistic that generalizes the well-known Kruskal-Wallis procedure to the multivariate setting. The proposed approach does not require any knowledge about the distribution of the observations and is parameter-free. It is computationally efficient thanks to the use of dynamic programming and can also be applied when the number of change-points is unknown. The method is shown through simulations to be more robust than alternatives, particularly when faced with atypical distributions (e.g., with outliers), high noise levels and/or high-dimensional data.

preprint2010arXiv

Asymptotic properties of U-processes under long-range dependence

Let $(X_i)_{i\geq 1}$ be a stationary mean-zero Gaussian process with covariances $ρ(k)=\PE(X_{1}X_{k+1})$ satisfying: $ρ(0)=1$ and $ρ(k)=k^{-D} L(k)$ where $D$ is in $(0,1)$ and $L$ is slowly varying at infinity. Consider the $U$-process $\{U_n(r),\; r\in I\}$ defined as $$ U_n(r)=\frac{1}{n(n-1)}\sum_{1\leq i\neq j\leq n}\1_{\{G(X_i,X_j)\leq r\}}\; , $$ where $I$ is an interval included in $\rset$ and $G$ is a symmetric function. In this paper, we provide central and non-central limit theorems for $U_n$. They are used to derive the asymptotic behavior of the Hodges-Lehmann estimator, the Wilcoxon-signed rank statistic, the sample correlation integral and an associated scale estimator. The limiting distributions are expressed through multiple Wiener-Itô integrals.

preprint2010arXiv

Central limit theorem for the robust log-regression wavelet estimation of the memory parameter in the Gaussian semi-parametric context

In this paper, we study robust estimators of the memory parameter d of a (possibly) non stationary Gaussian time series with generalized spectral density f. This generalized spectral density is characterized by the memory parameter d and by a function f* which specifies the short-range dependence structure of the process. Our setting is semi-parametric since both f* and d are unknown and d is the only parameter of interest. The memory parameter d is estimated by regressing the logarithm of the estimated variance of the wavelet coefficients at different scales. The two estimators of d that we consider are based on robust estimators of the variance of the wavelet coefficients, namely the square of the scale estimator proposed by Rousseeuw and Croux (1993) and the median of the square of the wavelet coefficients. We establish a Central Limit Theorem for these robust estimators as well as for the estimator of d based on the classical estimator of the variance proposed by Moulines, Roueff and Taqqu (2007). Some Monte-Carlo experiments are presented to illustrate our claims and compare the performance of the different estimators. The properties of the three estimators are also compared on the Nile River data and the Internet traffic packet counts data. The theoretical results and the empirical evidence strongly suggest using the robust estimators as an alternative to estimate the memory parameter d of Gaussian time series.

Céline Lévy-Leduc

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

Identification of prognostic and predictive biomarkers in high-dimensional data with PPLasso

Variable selection in high-dimensional logistic regression models using a whitening approach

Variable selection in sparse GLARMA models

A variable selection approach for highly correlated predictors in high-dimensional genomic data

Fast Detection of Block Boundaries in Block Wise Constant Matrices: An Application to HiC data

Improving heritability estimation by a variable selection approach in sparse high dimensional linear mixed models

Nonparametric homogeneity tests and multiple change-point estimation for analyzing large Hi-C data matrices

A robust approach for estimating change-points in the mean of an AR(1) process

Heritability estimation in high dimensional linear mixed models

Homogeneity and change-point detection tests for multivariate data using rank statistics

Distributed detection/localization of change-points in high-dimensional network traffic data

OMP-type Algorithm with Structured Sparsity Patterns for Multipath Radar Signals

Robust Retrospective Multiple Change-point Estimation for Multivariate Data

Asymptotic properties of U-processes under long-range dependence

Central limit theorem for the robust log-regression wavelet estimation of the memory parameter in the Gaussian semi-parametric context