Source author record

Yanyuan Ma

Yanyuan Ma appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory Methodology Machine Learning Applications Computation

Catalog footprint

What is connected

12works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

On High dimensional Poisson models with measurement error: hypothesis testing for nonlinear nonconvex optimization

We study estimation and testing in the Poisson regression model with noisy high dimensional covariates, which has wide applications in analyzing noisy big data. Correcting for the estimation bias due to the covariate noise leads to a non-convex target function to minimize. Treating the high dimensional issue further leads us to augment an amenable penalty term to the target function. We propose to estimate the regression parameter through minimizing the penalized target function. We derive the L1 and L2 convergence rates of the estimator and prove the variable selection consistency. We further establish the asymptotic normality of any subset of the parameters, where the subset can have infinitely many components as long as its cardinality grows sufficiently slow. We develop Wald and score tests based on the asymptotic normality of the estimator, which permits testing of linear functions of the members if the subset. We examine the finite sample performance of the proposed tests by extensive simulation. Finally, the proposed method is successfully applied to the Alzheimer's Disease Neuroimaging Initiative study, which motivated this work initially.

preprint2022arXiv

Semiparametric Approach to Estimation of Marginal and Quantile Effects

We consider a semiparametric generalized linear model and study estimation of both marginal and quantile effects in this model. We propose an approximate maximum likelihood estimator, and rigorously establish the consistency, the asymptotic normality, and the semiparametric efficiency of our method in both the marginal effect and the quantile effect estimation. Simulation studies are conducted to illustrate the finite sample performance, and we apply the new tool to analyze a Swiss non-labor income data and discover a new interesting predictor.

preprint2021arXiv

Efficient computational algorithms for approximate optimal designs

In this paper, we propose two simple yet efficient computational algorithms to obtain approximate optimal designs for multi-dimensional linear regression on a large variety of design spaces. We focus on the two commonly used optimal criteria, $D$- and $A$-optimal criteria. For $D$-optimality, we provide an alternative proof for the monotonic convergence for $D$-optimal criterion and propose an efficient computational algorithm to obtain the approximate $D$-optimal design. We further show that the proposed algorithm converges to the $D$-optimal design, and then prove that the approximate $D$-optimal design converges to the continuous $D$-optimal design under certain conditions. For $A$-optimality, we provide an efficient algorithm to obtain approximate $A$-optimal design and conjecture the monotonicity of the proposed algorithm. Numerical comparisons suggest that the proposed algorithms perform well and they are comparable or superior to some existing algorithms.

preprint2020arXiv

Inference in High-Dimensional Linear Measurement Error Models

For a high-dimensional linear model with a finite number of covariates measured with error, we study statistical inference on the parameters associated with the error-prone covariates, and propose a new corrected decorrelated score test and the corresponding one-step estimator. We further establish asymptotic properties of the newly proposed test statistic and the one-step estimator. Under local alternatives, we show that the limiting distribution of our corrected decorrelated score test statistic is non-central normal. The finite-sample performance of the proposed inference procedure is examined through simulation studies. We further illustrate the proposed procedure via an empirical analysis of a real data example.

preprint2020arXiv

Optimal subsampling for quantile regression in big data

We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive two versions of optimal subsampling probabilities. One version minimizes the trace of the asymptotic variance-covariance matrix for a linearly transformed parameter estimator and the other minimizes that of the original parameter estimator. The former does not depend on the densities of the responses given covariates and is easy to implement. Algorithms based on optimal subsampling probabilities are proposed and asymptotic distributions and asymptotic optimality of the resulting estimators are established. Furthermore, we propose an iterative subsampling procedure based on the optimal subsampling probabilities in the linearly transformed parameter estimation which has great scalability to utilize available computational resources. In addition, this procedure yields standard errors for parameter estimators without estimating the densities of the responses given the covariates. We provide numerical examples based on both simulated and real data to illustrate the proposed method.

preprint2015arXiv

A Paradigmatic Regression Algorithm for Gene Selection Problems

Motivation: Gene selection has become a common task in most gene expression studies. The objective of such research is often to identify the smallest possible set of genes that can still achieve good predictive performance. The problem of assigning tumours to a known class is a particularly important example that has received considerable attention in the last ten years. Many of the classification methods proposed recently require some form of dimension-reduction of the problem. These methods provide a single model as an output and, in most cases, rely on the likelihood function in order to achieve variable selection. Results: We propose a prediction-based objective function that can be tailored to the requirements of practitioners and can be used to assess and interpret a given problem. The direct optimization of such a function can be very difficult because the problem is potentially discontinuous and nonconvex. We therefore propose a general procedure for variable selection that resembles importance sampling to explore the feature space. Our proposal compares favorably with competing alternatives when applied to two cancer data sets in that smaller models are obtained for better or at least comparable classification errors. Furthermore by providing a set of selected models instead of a single one, we construct a network of possible models for a target prediction accuracy level.

preprint2015arXiv

Functional Inverse Regression in an Enlarged Dimension Reduction Space

We consider an enlarged dimension reduction space in functional inverse regression. Our operator and functional analysis based approach facilitates a compact and rigorous formulation of the functional inverse regression problem. It also enables us to expand the possible space where the dimension reduction functions belong. Our formulation provides a unified framework so that the classical notions, such as covariance standardization, Mahalanobis distance, SIR and linear discriminant analysis, can be naturally and smoothly carried out in our enlarged space. This enlarged dimension reduction space also links to the linear discriminant space of Gaussian measures on a separable Hilbert space.

preprint2015arXiv

Fused kernel-spline smoothing for repeatedly measured outcomes in a generalized partially linear model with functional single index

We propose a generalized partially linear functional single index risk score model for repeatedly measured outcomes where the index itself is a function of time. We fuse the nonparametric kernel method and regression spline method, and modify the generalized estimating equation to facilitate estimation and inference. We use local smoothing kernel to estimate the unspecified coefficient functions of time, and use B-splines to estimate the unspecified function of the single index component. The covariance structure is taken into account via a working model, which provides valid estimation and inference procedure whether or not it captures the true covariance. The estimation method is applicable to both continuous and discrete outcomes. We derive large sample properties of the estimation procedure and show a different convergence rate for each component of the model. The asymptotic properties when the kernel and regression spline methods are combined in a nested fashion has not been studied prior to this work, even in the independent data case.

preprint2014arXiv

Combining isotonic regression and EM algorithm to predict genetic risk under monotonicity constraint

In certain genetic studies, clinicians and genetic counselors are interested in estimating the cumulative risk of a disease for individuals with and without a rare deleterious mutation. Estimating the cumulative risk is difficult, however, when the estimates are based on family history data. Often, the genetic mutation status in many family members is unknown; instead, only estimated probabilities of a patient having a certain mutation status are available. Also, ages of disease-onset are subject to right censoring. Existing methods to estimate the cumulative risk using such family-based data only provide estimation at individual time points, and are not guaranteed to be monotonic or nonnegative. In this paper, we develop a novel method that combines Expectation-Maximization and isotonic regression to estimate the cumulative risk across the entire support. Our estimator is monotonic, satisfies self-consistent estimating equations and has high power in detecting differences between the cumulative risks of different populations. Application of our estimator to a Parkinson's disease (PD) study provides the age-at-onset distribution of PD in PARK2 mutation carriers and noncarriers, and reveals a significant difference between the distribution in compound heterozygous carriers compared to noncarriers, but not between heterozygous carriers and noncarriers.

preprint2013arXiv

Optimal variance estimation without estimating the mean function

We study the least squares estimator in the residual variance estimation context. We show that the mean squared differences of paired observations are asymptotically normally distributed. We further establish that, by regressing the mean squared differences of these paired observations on the squared distances between paired covariates via a simple least squares procedure, the resulting variance estimator is not only asymptotically normal and root-$n$ consistent, but also reaches the optimal bound in terms of estimation variance. We also demonstrate the advantage of the least squares estimator in comparison with existing methods in terms of the second order asymptotic properties.

preprint2010arXiv

A semiparametric efficient estimator in case-control studies

We construct a semiparametric estimator in case-control studies where the gene and the environment are assumed to be independent. A discrete or continuous parametric distribution of the genes is assumed in the model. A discrete distribution of the genes can be used to model the mutation or presence of certain group of genes. A continuous distribution allows the distribution of the gene effects to be in a finite-dimensional parametric family and can hence be used to model the gene expression levels. We leave the distribution of the environment totally unspecified. The estimator is derived through calculating the efficiency score function in a hypothetical setting where a close approximation to the samples is random. The resulting estimator is proved to be efficient in the hypothetical situation. The efficiency of the estimator is further demonstrated to hold in the case-control setting as well.

preprint2010arXiv

Variable selection in measurement error models

Measurement error data or errors-in-variable data have been collected in many studies. Natural criterion functions are often unavailable for general functional measurement error models due to the lack of information on the distribution of the unobservable covariates. Typically, the parameter estimation is via solving estimating equations. In addition, the construction of such estimating equations routinely requires solving integral equations, hence the computation is often much more intensive compared with ordinary regression models. Because of these difficulties, traditional best subset variable selection procedures are not applicable, and in the measurement error model context, variable selection remains an unsolved issue. In this paper, we develop a framework for variable selection in measurement error models via penalized estimating equations. We first propose a class of selection procedures for general parametric measurement error models and for general semi-parametric measurement error models, and study the asymptotic properties of the proposed procedures. Then, under certain regularity conditions and with a properly chosen regularization parameter, we demonstrate that the proposed procedure performs as well as an oracle procedure. We assess the finite sample performance via Monte Carlo simulation studies and illustrate the proposed methodology through the empirical analysis of a familiar data set.

Yanyuan Ma

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

On High dimensional Poisson models with measurement error: hypothesis testing for nonlinear nonconvex optimization

Semiparametric Approach to Estimation of Marginal and Quantile Effects

Efficient computational algorithms for approximate optimal designs

Inference in High-Dimensional Linear Measurement Error Models

Optimal subsampling for quantile regression in big data

A Paradigmatic Regression Algorithm for Gene Selection Problems

Functional Inverse Regression in an Enlarged Dimension Reduction Space

Fused kernel-spline smoothing for repeatedly measured outcomes in a generalized partially linear model with functional single index

Combining isotonic regression and EM algorithm to predict genetic risk under monotonicity constraint

Optimal variance estimation without estimating the mean function

A semiparametric efficient estimator in case-control studies

Variable selection in measurement error models