Source author record

Hung Hung

Hung Hung appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications math.OC math.ST Statistics Theory

Catalog footprint

What is connected

12works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Robust self-tuning semiparametric PCA for contaminated elliptical distribution

Principal component analysis (PCA) is one of the most popular dimension reduction methods. The usual PCA is known to be sensitive to the presence of outliers, and thus many robust PCA methods have been developed. Among them, the Tyler's M-estimator is shown to be the most robust scatter estimator under the elliptical distribution. However, when the underlying distribution is contaminated and deviates from ellipticity, Tyler's M-estimator might not work well. In this article, we apply the semiparametric theory to propose a robust semiparametric PCA. The merits of our proposal are twofold. First, it is robust to heavy-tailed elliptical distributions as well as robust to non-elliptical outliers. Second, it pairs well with a data-driven tuning procedure, which is based on active ratio and can adapt to different degrees of data outlyingness. Theoretical properties are derived, including the influence functions for various statistical functionals and asymptotic normality. Simulation studies and a data analysis demonstrate the superiority of our method.

preprint2020arXiv

A generalized information criterion for high-dimensional PCA rank selection

Principal component analysis (PCA) is the most commonly used statistical procedure for dimension reduction. An important issue for applying PCA is to determine the rank, which is the number of dominant eigenvalues of the covariance matrix. The Akaike information criterion (AIC) and Bayesian information criterion (BIC) are among the most widely used rank selection methods. Both use the number of free parameters for assessing model complexity. In this work, we adopt the generalized information criterion (GIC) to propose a new method for PCA rank selection under the high-dimensional framework. The GIC model complexity takes into account the sizes of covariance eigenvalues and can be better adaptive to practical applications. Asymptotic properties of GIC are derived and the selection consistency is established under the generalized spiked covariance model.

preprint2016arXiv

A low-rank based estimation-testing procedure for matrix-covariate regression

Matrix-covariate is now frequently encountered in many biomedical researches. It is common to fit conventional statistical models by vectorizing matrix-covariate. This strategy, however, results in a large number of parameters, while the available sample size is relatively too small to have reliable analysis results. To overcome the problem of high-dimensionality in hypothesis testing, variance component test has been proposed with promise detection power, but is not straightforward to provide estimates of effect size. In this work, we overcome the problem of high-dimensionality by utilizing the inherent structure of the matrix-covariate. The advantage is that estimation and hypothesis testing can be conducted simultaneously as in the conventional case, while the estimation efficiency and detection power can be largely improved, due to a parsimonious parameterization for the coefficients of matrix-covariate. Our method is applied to test the significance of gene-gene interactions in the PSQI data, and is applied to test if electroencephalography is associated with the alcoholic status in the EEG data, wherein sparse effects and low-rank effects of matrix-covariates are identified, respectively.

preprint2014arXiv

$γ$-SUP: A clustering algorithm for cryo-electron microscopy images of asymmetric particles

Cryo-electron microscopy (cryo-EM) has recently emerged as a powerful tool for obtaining three-dimensional (3D) structures of biological macromolecules in native states. A minimum cryo-EM image data set for deriving a meaningful reconstruction is comprised of thousands of randomly orientated projections of identical particles photographed with a small number of electrons. The computation of 3D structure from 2D projections requires clustering, which aims to enhance the signal to noise ratio in each view by grouping similarly oriented images. Nevertheless, the prevailing clustering techniques are often compromised by three characteristics of cryo-EM data: high noise content, high dimensionality and large number of clusters. Moreover, since clustering requires registering images of similar orientation into the same pixel coordinates by 2D alignment, it is desired that the clustering algorithm can label misaligned images as outliers. Herein, we introduce a clustering algorithm $γ$-SUP to model the data with a $q$-Gaussian mixture and adopt the minimum $γ$-divergence for estimation, and then use a self-updating procedure to obtain the numerical solution. We apply $γ$-SUP to the cryo-EM images of two benchmark macromolecules, RNA polymerase II and ribosome. In the former case, simulated images were chosen to decouple clustering from alignment to demonstrate $γ$-SUP is more robust to misalignment outliers than the existing clustering methods used in the cryo-EM community. In the latter case, the clustering of real cryo-EM data by our $γ$-SUP method eliminates noise in many views to reveal true structure features of ribosome at the projection level.

preprint2014arXiv

Recovering rank-one matrices via rank-r matrices relaxation

PhaseLift, proposed by E.J. Candès et al., is one convex relaxation approach for phase retrieval. The relaxation enlarges the solution set from rank one matrices to positive semidefinite matrices. In this paper, a relaxation is employed to nonconvex alternating minimization methods to recover the rank-one matrices. A generic measurement matrix can be standardized to a matrix consisting of orthonormal columns. To recover the rank-one matrix, the standardized frames are used to select the matrix with the maximal leading eigenvalue among the rank-$r$ matrices. Empirical studies are conducted to validate the effectiveness of this relaxation approach. In the case of Gaussian random matrices with a sufficient number of nearly orthogonal sensing vectors, we show that the singular vector corresponding to the least singular value is close to the unknown signal, and thus it can be a good initialization for the nonconvex minimization algorithm.

preprint2014arXiv

Sufficient dimension reduction with additional information

Sufficient dimension reduction is widely applied to help model building between the response $Y$ and covariate $X$. While the target of interest is the relationship between $(Y,X)$, in some applications we also collect additional variable $W$ that is strongly correlated with $Y$. From a statistical point of view, making inference about $(Y,X)$ without using $W$ will lose efficiency. However, it is not trivial to incorporate the information of $W$ to infer $(Y,X)$. In this article, we propose a two-stage dimension reduction method for $(Y,X)$, that is able to utilize the additional information from $W$. The main idea is to confine the searching space, by constructing an envelope subspace for the target of interest. In the analysis of breast cancer data, the risk score constructed from the two-stage method can well separate patients with different survival experiences. In the Pima data, the two-stage method requires fewer components to infer the diabetes status, while achieving higher classification accuracy than conventional method.

preprint2013arXiv

Detection of Gene-Gene Interactions by Multistage Sparse and Low-Rank Regression

A daunting challenge faced by modern biological sciences is finding an efficient and computationally feasible approach to deal with the curse of high dimensionality. The problem becomes even more severe when the research focus is on interactions. To improve the performance, we propose a low-rank interaction model, where the interaction effects are modeled using a low-rank matrix. With parsimonious parameterization of interactions, the proposed model increases the stability and efficiency of statistical analysis. Built upon the low-rank model, we further propose an Extended Screen-and-Clean approach, based on the Screen and Clean (SC) method (Wasserman and Roeder, 2009; Wu et al., 2010), to detect gene-gene interactions. In particular, the screening stage utilizes a combination of a low-rank structure and a sparsity constraint in order to achieve higher power and higher selection-consistency probability. We demonstrate the effectiveness of the method using simulations and apply the proposed procedure on the warfarin dosage study. The data analysis identified main and interaction effects that would have been neglected using conventional methods.

preprint2012arXiv

A Two-Stage Dimension Reduction Method for Induced Responses and Its Applications

Researchers in the biological sciences nowadays often encounter the curse of high-dimensionality, which many previously developed statistical models fail to overcome. To tackle this problem, sufficient dimension reduction aims to estimate the central subspace (CS), in which all the necessary information supplied by the covariates regarding the response of interest is contained. Subsequent statistical analysis can then be made in a lower-dimensional space while preserving relevant information. Oftentimes studies are interested in a certain transformation of the response (the induced response), instead of the original one, whose corresponding CS may vary. When estimating the CS of the induced response, existing dimension reduction methods may, however, suffer the problem of inefficiency. In this article, we propose a more efficient two-stage estimation procedure to estimate the CS of an induced response. This approach is further extended to the case of censored responses. An application for combining multiple biomarkers is also illustrated. Simulation studies and two data examples provide further evidence of the usefulness of the proposed method.

preprint2012arXiv

Robust Independent Component Analysis via Minimum Divergence Estimation

Independent component analysis (ICA) has been shown to be useful in many applications. However, most ICA methods are sensitive to data contamination and outliers. In this article we introduce a general minimum U-divergence framework for ICA, which covers some standard ICA methods as special cases. Within the U-family we further focus on the gamma-divergence due to its desirable property of super robustness, which gives the proposed method gamma-ICA. Statistical properties and technical conditions for the consistency of gamma-ICA are rigorously studied. In the limiting case, it leads to a necessary and sufficient condition for the consistency of MLE-ICA. This necessary and sufficient condition is weaker than the condition known in the literature. Since the parameter of interest in ICA is an orthogonal matrix, a geometrical algorithm based on gradient flows on special orthogonal group is introduced to implement gamma-ICA. Furthermore, a data-driven selection for the gamma value, which is critical to the achievement of gamma-ICA, is developed. The performance, especially the robustness, of gamma-ICA in comparison with standard ICA methods is demonstrated through experimental studies using simulated data and image data.

preprint2011arXiv

Matrix Variate Logistic Regression Model with Application to EEG Data

Logistic regression has been widely applied in the field of biomedical research for a long time. In some applications, covariates of interest have a natural structure, such as being a matrix, at the time of collection. The rows and columns of the covariate matrix then have certain physical meanings, and they must contain useful information regarding the response. If we simply stack the covariate matrix as a vector and fit the conventional logistic regression model, relevant information can be lost, and the problem of inefficiency will arise. Motivated from these reasons, we propose in this paper the matrix variate logistic (MV-logistic) regression model. Advantages of MV-logistic regression model include the preservation of the inherent matrix structure of covariates and the parsimony of parameters needed. In the EEG Database Data Set, we successfully extract the structural effects of covariate matrix, and a high classification accuracy is achieved.

preprint2011arXiv

Nonparametric Methodology for the Time-Dependent Partial Area under the ROC Curve

To assess the classification accuracy of a continuous diagnostic result, the receiver operating characteristic (ROC) curve is commonly used in applications. The partial area under the ROC curve (pAUC) is one of widely accepted summary measures due to its generality and ease of probability interpretation. In the field of life science, a direct extension of the pAUC into the time-to-event setting can be used to measure the usefulness of a biomarker for disease detection over time. Without using a trapezoidal rule, we propose nonparametric estimators, which are easily computed and have closed-form expressions, for the time-dependent pAUC. The asymptotic Gaussian processes of the estimators are established and the estimated variance-covariance functions are provided, which are essential in the construction of confidence intervals. The finite sample performance of the proposed inference procedures are investigated through a series of simulations. Our method is further applied to evaluate the classification ability of CD4 cell counts on patient's survival time in the AIDS Clinical Trials Group (ACTG) 175 study. In addition, the inferences can be generalized to compare the time-dependent pAUCs between patients received the prior antiretroviral therapy and those without it.

preprint2011arXiv

On Multilinear Principal Component Analysis of Order-Two Tensors

Principal Component Analysis (PCA) is a commonly used tool for dimension reduction in analyzing high dimensional data; Multilinear Principal Component Analysis (MPCA) has the potential to serve the similar function for analyzing tensor structure data. MPCA and other tensor decomposition methods have been proved effective to reduce the dimensions for both real data analyses and simulation studies (Ye, 2005; Lu, Plataniotis and Venetsanopoulos, 2008; Kolda and Bader, 2009; Li, Kim and Altman, 2010). In this paper, we investigate MPCA's statistical properties and provide explanations for its advantages. Conventional PCA, vectorizing the tensor data, may lead to inefficient and unstable prediction due to its extremely large dimensionality. On the other hand, MPCA, trying to preserve the data structure, searches for low-dimensional multilinear projections and decreases the dimensionality efficiently. The asymptotic theories for order-two MPCA, including asymptotic distributions for principal components, associated projections and the explained variance, are developed. Finally, MPCA is shown to improve conventional PCA on analyzing the {\sf Olivetti Faces} data set, by constructing more module oriented basis in reconstructing the test faces.

Hung Hung

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Robust self-tuning semiparametric PCA for contaminated elliptical distribution

A generalized information criterion for high-dimensional PCA rank selection

A low-rank based estimation-testing procedure for matrix-covariate regression

$γ$-SUP: A clustering algorithm for cryo-electron microscopy images of asymmetric particles

Recovering rank-one matrices via rank-r matrices relaxation

Sufficient dimension reduction with additional information

Detection of Gene-Gene Interactions by Multistage Sparse and Low-Rank Regression

A Two-Stage Dimension Reduction Method for Induced Responses and Its Applications

Robust Independent Component Analysis via Minimum Divergence Estimation

Matrix Variate Logistic Regression Model with Application to EEG Data

Nonparametric Methodology for the Time-Dependent Partial Area under the ROC Curve

On Multilinear Principal Component Analysis of Order-Two Tensors