Source author record

Runze Li

Runze Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.ST Statistics Theory Methodology Applications Machine Learning Computer Vision cond-mat.mtrl-sci econ.EM eess.IV physics.acc-ph physics.comp-ph physics.flu-dyn physics.optics

Catalog footprint

What is connected

29works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Dynamic Mortality Forecasting via Mixed-Frequency State-Space Models

High-frequency death counts are now widely available and contain timely information about intra-year mortality dynamics, but most stochastic mortality models are still estimated on annual data and therefore update only when annual totals are released. We propose a mixed-frequency state-space (MF--SS) extension of the Lee--Carter framework that jointly uses annual mortality rates and monthly death counts. The two series are linked through a shared latent monthly mortality factor, with the annual period factor defined as the intra-year average of the monthly factors. The latent monthly factor follows a seasonal ARIMA process, and parameters are estimated by maximum likelihood using an EM algorithm with Kalman filtering and smoothing. This setup enables real-time intra-year updates of the latent state and forecasts as new monthly observations arrive without re-estimating model parameters. Using U.S. data for ages 20--90 over 1999--2019, we evaluate intra-year annual nowcasts and one- to five-year-ahead forecasts. The MF--SS model produces both a direct annual forecast and an annual forecast implied by aggregating monthly projections. In our application, the aggregated monthly forecast is typically more accurate. Incorporating monthly information substantially improves intra-year annual nowcasts, especially after the first few months of the year. As a benchmark, we also fit separate annual and monthly Lee--Carter models and combine their forecasts using temporal reconciliation. Reconciliation improves these independent forecasts but adds little to MF--SS forecasts, consistent with MF--SS pooling information across frequencies during estimation. The MF--SS aggregated monthly forecasts generally outperform both unreconciled and temporally reconciled Lee--Carter forecasts and produce more cautious predictive intervals than the reconciled Lee--Carter approach.

preprint2026arXiv

Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multi-modal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and robustness. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks and 12 benchmarks, while maintaining high inference efficiency. Notably, even after model compression, OneTrackerV2 retains strong performance. Moreover, OneTrackerV2 demonstrates remarkable robustness under modality-missing scenarios.

preprint2022arXiv

Causal Structural Learning on MPHIA Individual Dataset

The Population-based HIV Impact Assessment (PHIA) is an ongoing project that conducts nationally representative HIV-focused surveys for measuring national and regional progress toward UNAIDS' 90-90-90 targets, the primary strategy to end the HIV epidemic. We believe the PHIA survey offers a unique opportunity to better understand the key factors that drive the HIV epidemics in the most affected countries in sub-Saharan Africa. In this article, we propose a novel causal structural learning algorithm to discover important covariates and potential causal pathways for 90-90-90 targets. Existing constrained-based causal structural learning algorithms are quite aggressive in edge removal. The proposed algorithm preserves more information about important features and potential causal pathways. It is applied to the Malawi PHIA (MPHIA) data set and leads to interesting results. For example, it discovers age and condom usage to be important for female HIV awareness; the number of sexual partners to be important for male HIV awareness; and knowing the travel time to HIV care facilities leads to a higher chance of being treated for both females and males. We further compare and validate the proposed algorithm using BIC and using Monte Carlo simulations, and show that the proposed algorithm achieves improvement in true positive rates in important feature discovery over existing algorithms.

preprint2022arXiv

Invariance principle and CLT for the spiked eigenvalues of large-dimensional Fisher matrices and applications

This paper aims to derive asymptotical distributions of the spiked eigenvalues of the large-dimensional spiked Fisher matrices without Gaussian assumption and the restrictive assumptions on covariance matrices. We first establish invariance principle for the spiked eigenvalues of the Fisher matrix. That is, we show that the limiting distributions of the spiked eigenvalues are invariant over a large class of population distributions satisfying certain conditions. Using the invariance principle, we further established a central limit theorem (CLT) for the spiked eigenvalues. As some interesting applications, we use the CLT to derive the power functions of Roy Maximum root test for linear hypothesis in linear models and the test in signal detection. We conduct some Monte Carlo simulation to compare the proposed test with existing ones.

preprint2022arXiv

Model-Free Statistical Inference on High-Dimensional Data

This paper aims to develop an effective model-free inference procedure for high-dimensional data. We first reformulate the hypothesis testing problem via sufficient dimension reduction framework. With the aid of new reformulation, we propose a new test statistic and show that its asymptotic distribution is $χ^2$ distribution whose degree of freedom does not depend on the unknown population distribution. We further conduct power analysis under local alternative hypotheses. In addition, we study how to control the false discovery rate of the proposed $χ^2$ tests, which are correlated, to identify important predictors under a model-free framework. To this end, we propose a multiple testing procedure and establish its theoretical guarantees. Monte Carlo simulation studies are conducted to assess the performance of the proposed tests and an empirical analysis of a real-world data set is used to illustrate the proposed methodology.

preprint2022arXiv

MonoIndoor++:Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

Self-supervised monocular depth estimation has seen significant progress in recent years, especially in outdoor environments. However, depth prediction results are not satisfying in indoor scenes where most of the existing data are captured with hand-held devices. As compared to outdoor environments, estimating depth of monocular videos for indoor environments, using self-supervised methods, results in two additional challenges: (i) the depth range of indoor video sequences varies a lot across different frames, making it difficult for the depth network to induce consistent depth cues for training; (ii) the indoor sequences recorded with handheld devices often contain much more rotational motions, which cause difficulties for the pose network to predict accurate relative camera poses. In this work, we propose a novel framework-MonoIndoor++ by giving special considerations to those challenges and consolidating a set of good practices for improving the performance of self-supervised monocular depth estimation for indoor environments. First, a depth factorization module with transformer-based scale regression network is proposed to estimate a global depth scale factor explicitly, and the predicted scale factor can indicate the maximum depth values. Second, rather than using a single-stage pose estimation strategy as in previous methods, we propose to utilize a residual pose estimation module to estimate relative camera poses across consecutive frames iteratively. Third, to incorporate extensive coordinates guidance for our residual pose estimation module, we propose to perform coordinate convolutional encoding directly over the inputs to pose networks. The proposed method is validated on a variety of benchmark indoor datasets, i.e., EuRoC MAV, NYUv2, ScanNet and 7-Scenes, demonstrating the state-of-the-art performance.

preprint2022arXiv

Multiple-Splitting Projection Test for High-Dimensional Mean Vectors

We propose a multiple-splitting projection test (MPT) for one-sample mean vectors in high-dimensional settings. The idea of projection test is to project high-dimensional samples to a 1-dimensional space using an optimal projection direction such that traditional tests can be carried out with projected samples. However, estimation of the optimal projection direction has not been systematically studied in the literature. In this work, we bridge the gap by proposing a consistent estimation via regularized quadratic optimization. To retain type I error rate, we adopt a data-splitting strategy when constructing test statistics. To mitigate the power loss due to data-splitting, we further propose a test via multiple splits to enhance the testing power. We show that the $p$-values resulted from multiple splits are exchangeable. Unlike existing methods which tend to conservatively combine dependent $p$-values, we develop an exact level $α$ test that explicitly utilizes the exchangeability structure to achieve better power. Numerical studies show that the proposed test well retains the type I error rate and is more powerful than state-of-the-art tests.

preprint2022arXiv

Physically Interpretable Feature Learning and Inverse Design of Supercritical Airfoils

Machine-learning models have demonstrated a great ability to learn complex patterns and make predictions. In high-dimensional nonlinear problems of fluid dynamics, data representation often greatly affects the performance and interpretability of machine learning algorithms. With the increasing application of machine learning in fluid dynamics studies, the need for physically explainable models continues to grow. This paper proposes a feature learning algorithm based on variational autoencoders, which is able to assign physical features to some latent variables of the variational autoencoder. In addition, it is theoretically proved that the remaining latent variables are independent of the physical features. The proposed algorithm is trained to include shock wave features in its latent variables for the reconstruction of supercritical pressure distributions. The reconstruction accuracy and physical interpretability are also compared with those of other variational autoencoders. Then, the proposed algorithm is used for the inverse design of supercritical airfoils, which enables the generation of airfoil geometries based on physical features rather than the complete pressure distributions. It also demonstrates the ability to manipulate certain pressure distribution features of the airfoil without changing the others.

preprint2021arXiv

Model-free Feature Screening and FDR Control with Knockoff Features

This paper proposes a model-free and data-adaptive feature screening method for ultra-high dimensional datasets. The proposed method is based on the projection correlation which measures the dependence between two random vectors. This projection correlation based method does not require specifying a regression model and applies to the data in the presence of heavy-tailed errors and multivariate response. It enjoys both sure screening and rank consistency properties under weak assumptions. Further, a two-step approach is proposed to control the false discovery rate (FDR) in feature screening with the help of knockoff features. It can be shown that the proposed two-step approach enjoys both sure screening and FDR control if the pre-specified FDR level $α$ is greater or equal to $1/s$, where $s$ is the number of active features. The superior empirical performance of the proposed methods is justified by various numerical experiments and real data applications.

preprint2021arXiv

Space Charge Driven Emittance Growth and the Effect of Octupoles in IOTA

The Integrable Optics Test Accelerator (IOTA) at Fermilab is a small machine dedicated to a broad frontier accelerator physics program. An important aspect of this program is to investigate the potential benefits of the resonance free tune spread achievable with integrable optics to store and accelerate high intensity proton beams for which space charge is significant. In this context, a good understanding of proton beam emittance growth and particle loss mechanisms is essential. Assuming nominal design parameters, simulations show that for a bunched beam, the bulk of emittance growth takes place immediately following injection, typically within tens of turns. We attempt to account for this growth using a simplified RMS mismatch theory; some of its limitations and possible improvements are briefly discussed. We then compare theoretical predictions to simulations performed using the PIC code pyORBIT. Further exploring ways to mitigate emittance growth and reduce particle loss, we compare two beam matching strategies: (1) matching at the injection point (2) matching at the center of the nonlinear (octupole) insertion region where $β_x = β_y$. To observe how nonlinearity affects emittance growth and whether it dominates growth due to mismatch, we track two different distributions. Finally, we explore the potential of using octupoles in a quasi-integrable configuration to mitigate growth using a variety of initial distributions both at reduced and full intensities.

preprint2020arXiv

A New Procedure for Controlling False Discovery Rate in Large-Scale t-tests

This paper is concerned with false discovery rate (FDR) control in large-scale multiple testing problems. We first propose a new data-driven testing procedure for controlling the FDR in large-scale t-tests for one-sample mean problem. The proposed procedure achieves exact FDR control in finite sample settings when the populations are symmetric no matter the number of tests or sample sizes. Comparing with the existing bootstrap method for FDR control, the proposed procedure is computationally efficient. We show that the proposed method can control the FDR asymptotically for asymmetric populations even when the test statistics are not independent. We further show that the proposed procedure with a simple correction is as accurate as the bootstrap method to the second-order degree, and could be much more effective than the existing normal calibration. We extend the proposed procedure to two-sample mean problem. Empirical results show that the proposed procedures have better FDR control than existing ones when the proportion of true alternative hypotheses is not too low, while maintaining reasonably good detection ability.

preprint2020arXiv

A Robust Consistent Information Criterion for Model Selection based on Empirical Likelihood

Conventional likelihood-based information criteria for model selection rely on the distribution assumption of data. However, for complex data that are increasingly available in many scientific fields, the specification of their underlying distribution turns out to be challenging, and the existing criteria may be limited and are not general enough to handle a variety of model selection problems. Here, we propose a robust and consistent model selection criterion based upon the empirical likelihood function which is data-driven. In particular, this framework adopts plug-in estimators that can be achieved by solving external estimating equations, not limited to the empirical likelihood, which avoids potential computational convergence issues and allows versatile applications, such as generalized linear models, generalized estimating equations, penalized regressions and so on. The formulation of our proposed criterion is initially derived from the asymptotic expansion of the marginal likelihood under variable selection framework, but more importantly, the consistent model selection property is established under a general context. Extensive simulation studies confirm the out-performance of the proposal compared to traditional model selection criteria. Finally, an application to the Atherosclerosis Risk in Communities Study illustrates the practical value of this proposed framework.

preprint2020arXiv

Estimation and Inference for the Mediation Effect in a Time-varying Mediation Model

Traditional mediation analysis typically examines the relations among an intervention, a time-invariant mediator, and a time-invariant outcome variable. Although there may be a direct effect of the intervention on the outcome, there is a need to understand the process by which the intervention affects the outcome (i.e. the indirect effect through the mediator). This indirect effect is frequently assumed to be time-invariant. With improvements in data collection technology, it is possible to obtain repeated assessments over time resulting in intensive longitudinal data. This calls for an extension of traditional mediation analysis to incorporate time-varying variables as well as time-varying effects. In this paper, we focus on estimation and inference for the time-varying mediation model, which allows mediation effects to vary as a function of time. We propose a two-step approach to estimate the time-varying mediation effect. Moreover, we use a simulation based approach to derive the corresponding point-wise confidence band for the time-varying mediation effect. Simulation studies show that the proposed procedures perform well when comparing the confidence band and the true underlying model. We further apply the proposed model and the statistical inference procedure to real-world data collected from a smoking cessation study.

preprint2020arXiv

Inference in High-Dimensional Linear Measurement Error Models

For a high-dimensional linear model with a finite number of covariates measured with error, we study statistical inference on the parameters associated with the error-prone covariates, and propose a new corrected decorrelated score test and the corresponding one-step estimator. We further establish asymptotic properties of the newly proposed test statistic and the one-step estimator. Under local alternatives, we show that the limiting distribution of our corrected decorrelated score test statistic is non-central normal. The finite-sample performance of the proposed inference procedure is examined through simulation studies. We further illustrate the proposed procedure via an empirical analysis of a real data example.

preprint2020arXiv

Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection

We consider the stochastic contextual bandit problem under the high dimensional linear model. We focus on the case where the action space is finite and random, with each action associated with a randomly generated contextual covariate. This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine. However, it is very challenging as we need to balance exploration and exploitation. We propose doubly growing epochs and estimating the parameter using the best subset selection method, which is easy to implement in practice. This approach achieves $ \tilde{\mathcal{O}}(s\sqrt{T})$ regret with high probability, which is nearly independent in the ``ambient'' regression model dimension $d$. We further attain a sharper $\tilde{\mathcal{O}}(\sqrt{sT})$ regret by using the \textsc{SupLinUCB} framework and match the minimax lower bound of low-dimensional linear stochastic bandit problems. Finally, we conduct extensive numerical experiments to demonstrate the applicability and robustness of our algorithms empirically.

preprint2020arXiv

Towards Visually Explaining Variational Autoencoders

Recent advances in Convolutional Neural Network (CNN) model interpretability have led to impressive progress in visualizing and understanding model predictions. In particular, gradient-based visual attention methods have driven much recent effort in using visual attention maps as a means for visual explanations. A key problem, however, is these methods are designed for classification and categorization tasks, and their extension to explaining generative models, e.g. variational autoencoders (VAE) is not trivial. In this work, we take a step towards bridging this crucial gap, proposing the first technique to visually explain VAEs by means of gradient-based attention. We present methods to generate visual attention from the learned latent space, and also demonstrate such attention explanations serve more than just explaining VAE predictions. We show how these attention maps can be used to localize anomalies in images, demonstrating state-of-the-art performance on the MVTec-AD dataset. We also show how they can be infused into model training, helping bootstrap the VAE into learning improved latent space disentanglement, demonstrated on the Dsprites dataset.

preprint2019arXiv

Retrieval of non-sparse object through scattering media beyond the memory effect

Optical imaging through scattering media is a commonly confronted with the problem of reconstruction of complex objects and optical memory effect. To solve the problem, here, we propose a novel configuration based on the combination of ptychography and shower-curtain effect, which enables the retrieval of non-sparse samples through scattering media beyond the memory effect. Furthermore, by virtue of the shower-curtain effect, the proposed imaging system is insensitive to dynamic scattering media. Results from the retrieval of hair follicle section demonstrate the effectiveness and feasibility of the proposed method. The field of view is improved to 2.64mm. This present technique will be a potential approach for imaging through deep biological tissue.

preprint2016arXiv

Carrier emission of n-type Gallium Nitride illuminated by femtosecond laser pulses

The carrier emission efficiency of light emitting diodes is of fundamental importance for many technological applications, including the performance of GaN and other semiconductor photocathodes. We have measured the evolution of the emitted carriers and the associated transient electric field after femtosecond laser excitation of n-type GaN single crystals. These processes were studied using subpicosecond, ultrashort, electron pulses and explained by means of a three-layer analytical model. We find that for pump laser intensities on the order of 10^11 W/cm^2, the electrons that escaped from the crystal surface have a charge of about 2.7 pc and a velocity of about 1.8 um/ps. The associated transient electrical field evolves at intervals ranging from picoseconds to nanoseconds. These results provide a dynamic perspective on the photoemission properties of semiconductor photocathodes.

preprint2016arXiv

Global solutions to folded concave penalized nonconvex learning

This paper is concerned with solving nonconvex learning problems with folded concave penalty. Despite that their global solutions entail desirable statistical properties, they lack optimization techniques that guarantee global optimality in a general setting. In this paper, we show that a class of nonconvex learning problems are equivalent to general quadratic programs. This equivalence facilitates us in developing mixed integer linear programming reformulations, which admit finite algorithms that find a provably global optimal solution. We refer to this reformulation-based technique as the mixed integer programming-based global optimization (MIPGO). To our knowledge, this is the first global optimization scheme with a theoretical guarantee for folded concave penalized nonconvex learning with the SCAD penalty [J. Amer. Statist. Assoc. 96 (2001) 1348-1360] and the MCP penalty [Ann. Statist. 38 (2001) 894-942]. Numerical results indicate a significant outperformance of MIPGO over the state-of-the-art solution scheme, local linear approximation and other alternative solution techniques in literature in terms of solution quality.

preprint2015arXiv

A fast algorithm for detecting gene-gene interactions in genome-wide association studies

With the recent advent of high-throughput genotyping techniques, genetic data for genome-wide association studies (GWAS) have become increasingly available, which entails the development of efficient and effective statistical approaches. Although many such approaches have been developed and used to identify single-nucleotide polymorphisms (SNPs) that are associated with complex traits or diseases, few are able to detect gene-gene interactions among different SNPs. Genetic interactions, also known as epistasis, have been recognized to play a pivotal role in contributing to the genetic variation of phenotypic traits. However, because of an extremely large number of SNP-SNP combinations in GWAS, the model dimensionality can quickly become so overwhelming that no prevailing variable selection methods are capable of handling this problem. In this paper, we present a statistical framework for characterizing main genetic effects and epistatic interactions in a GWAS study. Specifically, we first propose a two-stage sure independence screening (TS-SIS) procedure and generate a pool of candidate SNPs and interactions, which serve as predictors to explain and predict the phenotypes of a complex trait. We also propose a rates adjusted thresholding estimation (RATE) approach to determine the size of the reduced model selected by an independence screening. Regularization regression methods, such as LASSO or SCAD, are then applied to further identify important genetic effects. Simulation studies show that the TS-SIS procedure is computationally efficient and has an outstanding finite sample performance in selecting potential SNPs as well as gene-gene interactions. We apply the proposed framework to analyze an ultrahigh-dimensional GWAS data set from the Framingham Heart Study, and select 23 active SNPs and 24 active epistatic interactions for the body mass index variation. It shows the capability of our procedure to resolve the complexity of genetic control.

preprint2015arXiv

Bayesian group Lasso for nonparametric varying-coefficient models with application to functional genome-wide association studies

Although genome-wide association studies (GWAS) have proven powerful for comprehending the genetic architecture of complex traits, they are challenged by a high dimension of single-nucleotide polymorphisms (SNPs) as predictors, the presence of complex environmental factors, and longitudinal or functional natures of many complex traits or diseases. To address these challenges, we propose a high-dimensional varying-coefficient model for incorporating functional aspects of phenotypic traits into GWAS to formulate a so-called functional GWAS or fGWAS. The Bayesian group lasso and the associated MCMC algorithms are developed to identify significant SNPs and estimate how they affect longitudinal traits through time-varying genetic actions. The model is generalized to analyze the genetic control of complex traits using subject-specific sparse longitudinal data. The statistical properties of the new model are investigated through simulation studies. We use the new model to analyze a real GWAS data set from the Framingham Heart Study, leading to the identification of several significant SNPs associated with age-specific changes of body mass index. The fGWAS model, equipped with the Bayesian group lasso, will provide a useful tool for genetic and developmental analysis of complex traits or diseases.

preprint2015arXiv

On the Feasibility of Distributed Kernel Regression for Big Data

In modern scientific research, massive datasets with huge numbers of observations are frequently encountered. To facilitate the computational process, a divide-and-conquer scheme is often used for the analysis of big data. In such a strategy, a full dataset is first split into several manageable segments; the final output is then averaged from the individual outputs of the segments. Despite its popularity in practice, it remains largely unknown that whether such a distributive strategy provides valid theoretical inferences to the original data. In this paper, we address this fundamental issue for the distributed kernel regression (DKR), where the algorithmic feasibility is measured by the generalization performance of the resulting estimator. To justify DKR, a uniform convergence rate is needed for bounding the generalization error over the individual outputs, which brings new and challenging issues in the big data setup. Under mild conditions, we show that, with a proper number of segments, DKR leads to an estimator that is generalization consistent to the unknown regression function. The obtained results justify the method of DKR and shed light on the feasibility of using other distributed algorithms for processing big data. The promising preference of the method is supported by both simulation and real data examples.

preprint2013arXiv

Calibrating nonconvex penalized regression in ultra-high dimension

We investigate high-dimensional nonconvex penalized regression, where the number of covariates may grow at an exponential rate. Although recent asymptotic theory established that there exists a local minimum possessing the oracle property under general conditions, it is still largely an open problem how to identify the oracle estimator among potentially multiple local minima. There are two main obstacles: (1) due to the presence of multiple minima, the solution path is nonunique and is not guaranteed to contain the oracle estimator; (2) even if a solution path is known to contain the oracle estimator, the optimal tuning parameter depends on many unknown factors and is hard to estimate. To address these two challenging issues, we first prove that an easy-to-calculate calibrated CCCP algorithm produces a consistent solution path which contains the oracle estimator with probability approaching one. Furthermore, we propose a high-dimensional BIC criterion and show that it can be applied to the solution path to select the optimal tuning parameter which asymptotically identifies the oracle estimator. The theory for a general class of nonconvex penalties in the ultra-high dimensional setup is established when the random errors follow the sub-Gaussian distribution. Monte Carlo studies confirm that the calibrated CCCP algorithm combined with the proposed high-dimensional BIC has desirable performance in identifying the underlying sparsity pattern for high-dimensional data analysis.

preprint2013arXiv

Multivariate varying coefficient model for functional responses

Motivated by recent work studying massive imaging data in the neuroimaging literature, we propose multivariate varying coefficient models (MVCM) for modeling the relation between multiple functional responses and a set of covariates. We develop several statistical inference procedures for MVCM and systematically study their theoretical properties. We first establish the weak convergence of the local linear estimate of coefficient functions, as well as its asymptotic bias and variance, and then we derive asymptotic bias and mean integrated squared error of smoothed individual functions and their uniform convergence rate. We establish the uniform convergence rate of the estimated covariance function of the individual functions and its associated eigenvalue and eigenfunctions. We propose a global test for linear hypotheses of varying coefficient functions, and derive its asymptotic distribution under the null hypothesis. We also propose a simultaneous confidence band for each individual effect curve. We conduct Monte Carlo simulation to examine the finite-sample performance of the proposed procedures. We apply MVCM to investigate the development of white matter diffusivities along the genu tract of the corpus callosum in a clinical study of neurodevelopment.

preprint2012arXiv

Estimation and testing for partially linear single-index models

In partially linear single-index models, we obtain the semiparametrically efficient profile least-squares estimators of regression coefficients. We also employ the smoothly clipped absolute deviation penalty (SCAD) approach to simultaneously select variables and estimate regression coefficients. We show that the resulting SCAD estimators are consistent and possess the oracle property. Subsequently, we demonstrate that a proposed tuning parameter selector, BIC, identifies the true model consistently. Finally, we develop a linear hypothesis test for the parametric coefficients and a goodness-of-fit test for the nonparametric component, respectively. Monte Carlo studies are also presented.

preprint2012arXiv

Feature Screening via Distance Correlation Learning

This paper is concerned with screening features in ultrahigh dimensional data analysis, which has become increasingly important in diverse scientific fields. We develop a sure independence screening procedure based on the distance correlation (DC-SIS, for short). The DC-SIS can be implemented as easily as the sure independence screening procedure based on the Pearson correlation (SIS, for short) proposed by Fan and Lv (2008). However, the DC-SIS can significantly improve the SIS. Fan and Lv (2008) established the sure screening property for the SIS based on linear models, but the sure screening property is valid for the DC-SIS under more general settings including linear models. Furthermore, the implementation of the DC-SIS does not require model specification (e.g., linear model or generalized linear model) for responses or predictors. This is a very appealing property in ultrahigh dimensional data analysis. Moreover, the DC-SIS can be used directly to screen grouped predictor variables and for multivariate response variables. We establish the sure screening property for the DC-SIS, and conduct simulations to examine its finite sample performance. Numerical comparison indicates that the DC-SIS performs much better than the SIS in various models. We also illustrate the DC-SIS through a real data example.

preprint2012arXiv

Variable selection in linear mixed effects models

This paper is concerned with the selection and estimation of fixed and random effects in linear mixed effects models. We propose a class of nonconcave penalized profile likelihood methods for selecting and estimating important fixed effects. To overcome the difficulty of unknown covariance matrix of random effects, we propose to use a proxy matrix in the penalized profile likelihood. We establish conditions on the choice of the proxy matrix and show that the proposed procedure enjoys the model selection consistency where the number of fixed effects is allowed to grow exponentially with the sample size. We further propose a group variable selection strategy to simultaneously select and estimate important random effects, where the unknown covariance matrix of random effects is replaced with a proxy matrix. We prove that, with the proxy matrix appropriately chosen, the proposed procedure can identify all true random effects with asymptotic probability one, where the dimension of random effects vector is allowed to increase exponentially with the sample size. Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed procedures. We further illustrate the proposed procedures via a real data example.

preprint2011arXiv

New efficient estimation and variable selection methods for semiparametric varying-coefficient partially linear models

The complexity of semiparametric models poses new challenges to statistical inference and model selection that frequently arise from real applications. In this work, we propose new estimation and variable selection procedures for the semiparametric varying-coefficient partially linear model. We first study quantile regression estimates for the nonparametric varying-coefficient functions and the parametric regression coefficients. To achieve nice efficiency properties, we further develop a semiparametric composite quantile regression procedure. We establish the asymptotic normality of proposed estimators for both the parametric and nonparametric parts and show that the estimators achieve the best convergence rate. Moreover, we show that the proposed method is much more efficient than the least-squares-based method for many non-normal errors and that it only loses a small amount of efficiency for normal errors. In addition, it is shown that the loss in efficiency is at most 11.1% for estimating varying coefficient functions and is no greater than 13.6% for estimating parametric components. To achieve sparsity with high-dimensional covariates, we propose adaptive penalization methods for variable selection in the semiparametric varying-coefficient partially linear model and prove that the methods possess the oracle property. Extensive Monte Carlo simulation studies are conducted to examine the finite-sample performance of the proposed procedures. Finally, we apply the new methods to analyze the plasma beta-carotene level data.

preprint2010arXiv

Variable selection in measurement error models

Measurement error data or errors-in-variable data have been collected in many studies. Natural criterion functions are often unavailable for general functional measurement error models due to the lack of information on the distribution of the unobservable covariates. Typically, the parameter estimation is via solving estimating equations. In addition, the construction of such estimating equations routinely requires solving integral equations, hence the computation is often much more intensive compared with ordinary regression models. Because of these difficulties, traditional best subset variable selection procedures are not applicable, and in the measurement error model context, variable selection remains an unsolved issue. In this paper, we develop a framework for variable selection in measurement error models via penalized estimating equations. We first propose a class of selection procedures for general parametric measurement error models and for general semi-parametric measurement error models, and study the asymptotic properties of the proposed procedures. Then, under certain regularity conditions and with a properly chosen regularization parameter, we demonstrate that the proposed procedure performs as well as an oracle procedure. We assess the finite sample performance via Monte Carlo simulation studies and illustrate the proposed methodology through the empirical analysis of a familiar data set.

Runze Li

What is connected

Connect this record

See the researcher in context

Building this map preview

29 published item(s)

Dynamic Mortality Forecasting via Mixed-Frequency State-Space Models

Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

Causal Structural Learning on MPHIA Individual Dataset

Invariance principle and CLT for the spiked eigenvalues of large-dimensional Fisher matrices and applications

Model-Free Statistical Inference on High-Dimensional Data

MonoIndoor++:Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments

Multiple-Splitting Projection Test for High-Dimensional Mean Vectors

Physically Interpretable Feature Learning and Inverse Design of Supercritical Airfoils

Model-free Feature Screening and FDR Control with Knockoff Features

Space Charge Driven Emittance Growth and the Effect of Octupoles in IOTA

A New Procedure for Controlling False Discovery Rate in Large-Scale t-tests

A Robust Consistent Information Criterion for Model Selection based on Empirical Likelihood

Estimation and Inference for the Mediation Effect in a Time-varying Mediation Model

Inference in High-Dimensional Linear Measurement Error Models

Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection

Towards Visually Explaining Variational Autoencoders

Retrieval of non-sparse object through scattering media beyond the memory effect

Carrier emission of n-type Gallium Nitride illuminated by femtosecond laser pulses

Global solutions to folded concave penalized nonconvex learning

A fast algorithm for detecting gene-gene interactions in genome-wide association studies

Bayesian group Lasso for nonparametric varying-coefficient models with application to functional genome-wide association studies

On the Feasibility of Distributed Kernel Regression for Big Data

Calibrating nonconvex penalized regression in ultra-high dimension

Multivariate varying coefficient model for functional responses

Estimation and testing for partially linear single-index models

Feature Screening via Distance Correlation Learning

Variable selection in linear mixed effects models

New efficient estimation and variable selection methods for semiparametric varying-coefficient partially linear models

Variable selection in measurement error models