Source author record

The Tien Mai

The Tien Mai appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications Machine Learning Computation math.ST Statistics Theory Quantitative Methods

Catalog footprint

What is connected

12works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Censored Graphical Horseshoe: Bayesian sparse precision matrix estimation with censored and missing data

Gaussian graphical models provide a powerful framework for studying conditional dependencies in multivariate data, with widespread applications spanning biomedical, environmental sciences, and other data-rich scientific domains. While the Graphical Horseshoe (GHS) method has emerged as a state-of-the-art Bayesian method for sparse precision matrix estimation, existing approaches assume fully observed data and thus fail in the presence of censoring or missingness, which are pervasive in real-world studies. In this paper, we develop the Censored Graphical Horseshoe (CGHS), a novel Bayesian framework that extends the GHS to censored and arbitrarily missing Gaussian data. By introducing a latent-variable representation, CGHS accommodates incomplete observations while retaining the adaptive global-local shrinkage properties of the Horseshoe prior. We derive efficient Gibbs samplers for posterior computation and establish new theoretical results on posterior behavior under censoring and missingness, filling a gap not addressed by frequentist Lasso-based methods. Through extensive simulations, we demonstrate that CGHS consistently improves estimation accuracy compared to penalized likelihood approaches. Our methods are implemented in the package GHScenmis available on Github: https://github.com/tienmt/ghscenmis .

preprint2026arXiv

Robust low-rank estimation with multiple binary responses using pairwise AUC loss

Multiple binary responses arise in many modern data-analytic problems. Although fitting separate logistic regressions for each response is computationally attractive, it ignores shared structure and can be statistically inefficient, especially in high-dimensional and class-imbalanced regimes. Low-rank models offer a natural way to encode latent dependence across tasks, but existing methods for binary data are largely likelihood-based and focus on pointwise classification rather than ranking performance. In this work, we propose a unified framework for learning with multiple binary responses that directly targets discrimination by minimizing a surrogate loss for the area under the ROC curve (AUC). The method aggregates pairwise AUC surrogate losses across responses while imposing a low-rank constraint on the coefficient matrix to exploit shared structure. We develop a scalable projected gradient descent algorithm based on truncated singular value decomposition. Exploiting the fact that the pairwise loss depends only on differences of linear predictors, we simplify computation and analysis. We establish non-asymptotic convergence guarantees, showing that under suitable regularity conditions, leading to linear convergence up to the minimax-optimal statistical precision. Extensive simulation studies demonstrate that the proposed method is robust in challenging settings such as label switching and data contamination and consistently outperforms likelihood-based approaches.

preprint2025arXiv

Robust reduced rank regression under heavy-tailed noise and missing data via non-convex penalization

Reduced rank regression (RRR) is a fundamental tool for modeling multiple responses through low-dimensional latent structures, offering both interpretability and strong predictive performance in high-dimensional settings. Classical RRR methods, however, typically rely on squared loss and Gaussian noise assumptions, rendering them sensitive to heavy-tailed errors, outliers, and data contamination. Moreover, the presence of missing data--common in modern applications--further complicates reliable low-rank estimation. In this paper, we propose a robust reduced rank regression framework that simultaneously addresses heavy-tailed noise, outliers, and missing data. Our approach combines a robust Huber loss with nonconvex spectral regularization, specifically the minimax concave penalty (MCP) and smoothly clipped absolute deviation (SCAD). Unlike convex nuclear-norm regularization, the proposed nonconvex penalties alleviate excessive shrinkage and enable more accurate recovery of the underlying low-rank structure. The method also accommodates missing data in the response matrix without requiring imputation. We develop an efficient proximal gradient algorithm based on alternating updates and tailored spectral thresholding. Extensive simulation studies demonstrate that the proposed methods substantially outperform nuclear-norm-based and non-robust alternatives under heavy-tailed noise and contamination. An application to cancer cell line data set further illustrates the practical advantages of the proposed robust RRR framework. Our method is implemented in the R package rrpackrobust available at https://github.com/tienmt/rrpackrobust.

preprint2025arXiv

Sparse classification with positive-confidence data in high dimensions

High-dimensional learning problems, where the number of features exceeds the sample size, often require sparse regularization for effective prediction and variable selection. While established for fully supervised data, these techniques remain underexplored in weak-supervision settings such as Positive-Confidence (Pconf) classification. Pconf learning utilizes only positive samples equipped with confidence scores, thereby avoiding the need for negative data. However, existing Pconf methods are ill-suited for high-dimensional regimes. This paper proposes a novel sparse-penalization framework for high-dimensional Pconf classification. We introduce estimators using convex (Lasso) and non-convex (SCAD, MCP) penalties to address shrinkage bias and improve feature recovery. Theoretically, we establish estimation and prediction error bounds for the L1-regularized Pconf estimator, proving it achieves near minimax-optimal sparse recovery rates under Restricted Strong Convexity condition. To solve the resulting composite objective, we develop an efficient proximal gradient algorithm. Extensive simulations demonstrate that our proposed methods achieve predictive performance and variable selection accuracy comparable to fully supervised approaches, effectively bridging the gap between weak supervision and high-dimensional statistics.

preprint2022arXiv

Optimal quasi-Bayesian reduced rank regression with incomplete response

The aim of reduced rank regression is to connect multiple response variables to multiple predictors. This model is very popular, especially in biostatistics where multiple measurements on individuals can be re-used to predict multiple outputs. Unfortunately, there are often missing data in such datasets, making it difficult to use standard estimation tools. In this paper, we study the problem of reduced rank regression where the response matrix is incomplete. We propose a quasi-Bayesian approach to this problem, in the sense that the likelihood is replaced by a quasi-likelihood. We provide a tight oracle inequality, proving that our method is adaptive to the rank of the coefficient matrix. We describe a Langevin Monte Carlo algorithm for the computation of the posterior mean. Numerical comparison on synthetic and real data show that our method are competitive to the state-of-the-art where the rank is chosen by cross validation, and sometimes lead to an improvement.

preprint2022arXiv

PAC-Bayesian Matrix Completion with a Spectral Scaled Student Prior

We study the problem of matrix completion in this paper. A spectral scaled Student prior is exploited to favour the underlying low-rank structure of the data matrix. We provide a thorough theoretical investigation for our approach through PAC-Bayesian bounds. More precisely, our PAC-Bayesian approach enjoys a minimax-optimal oracle inequality which guarantees that our method works well under model misspecification and under general sampling distribution. Interestingly, we also provide efficient gradient-based sampling implementations for our approach by using Langevin Monte Carlo. More specifically, we show that our algorithms are significantly faster than Gibbs sampler in this problem. To illustrate the attractive features of our inference strategy, some numerical simulations are conducted and an application to image inpainting is demonstrated.

preprint2021arXiv

Boosting heritability: estimating the genetic component of phenotypic variation with multiple sample splitting

Background: Heritability is a central measure in genetics quantifying how much of the variability observed in a trait is attributable to genetic differences. Existing methods for estimating heritability are most often based on random-effect models, typically for computational reasons. The alternative of using a fixed-effect model has received much more limited attention in the literature. Results: In this paper, we propose a generic strategy for heritability inference, termed as ``boosting heritability", by combining the advantageous features of different recent methods to produce an estimate of the heritability with a high-dimensional linear model. Boosting heritability uses in particular a multiple sample splitting strategy which leads in general to a stable and and accurate estimate. We use both simulated data and real antibiotic resistance data from a major human pathogen, Sptreptococcus pneumoniae, to demonstrate the attractive features of our inference strategy. Conclusions: Boosting is shown to offer a reliable and practically useful tool for inference about heritability.

preprint2021arXiv

Efficient Bayesian reduced rank regression using Langevin Monte Carlo approach

The problem of Bayesian reduced rank regression is considered in this paper. We propose, for the first time, to use Langevin Monte Carlo method in this problem. A spectral scaled Student prior distrbution is used to exploit the underlying low-rank structure of the coefficient matrix. We show that our algorithms are significantly faster than the Gibbs sampler in high-dimensional setting. Simulation results show that our proposed algorithms for Bayesian reduced rank regression are comparable to the state-of-the-art method where the rank is chosen by cross validation.

preprint2021arXiv

On regret bounds for continual single-index learning

In this paper, we generalize the problem of single-index model to the context of continual learning in which a learner is challenged with a sequence of tasks one by one and the dataset of each task is revealed in an online fashion. We propose a randomized strategy that is able to learn a common single-index (meta-parameter) for all tasks and a specific link function for each task. The common single-index allows to transfer the information gained from the previous tasks to a new one. We provide a rigorous theoretical analysis of our proposed strategy by proving some regret bounds under different assumption on the loss function.

preprint2021arXiv

Understanding the population structure correction regression

Although genome-wide association studies (GWAS) on complex traits have achieved great successes, the current leading GWAS approaches simply perform to test each genotype-phenotype association separately for each genetic variant. Curiously, the statistical properties for using these approaches is not known when a joint model for the whole genetic variants is considered. Here we advance in GWAS in understanding the statistical properties of the "population structure correction" (PSC) approach, a standard univariate approach in GWAS. We further propose and analyse a correction to the PSC approach, termed as "corrected population correction" (CPC). Together with the theoretical results, numerical simulations show that CPC is always comparable or better than PSC, with a dramatic improvement in some special cases.

preprint2019arXiv

Composite local low-rank structure in learning drug sensitivity

The molecular characterization of tumor samples by multiple omics data sets of different types or modalities (e.g. gene expression, mutation, CpG methylation) has become an invaluable source of information for assessing the expected performance of individual drugs and their combinations. Merging the relevant information from the omics data modalities provides the statistical basis for determining suitable therapies for specific cancer patients. Different data modalities may each have their specific structures that need to be taken into account during inference. In this paper, we assume that each omics data modality has a low-rank structure with only few relevant features that affect the prediction and we propose to use a composite local nuclear norm penalization for learning drug sensitivity. Numerical results show that the composite low-rank structure can improve the prediction performance compared to using a global low-rank approach or elastic net regression.

preprint2015arXiv

A Bayesian Approach for Noisy Matrix Completion: Optimal Rate under General Sampling Distribution

Bayesian methods for low-rank matrix completion with noise have been shown to be very efficient computationally. While the behaviour of penalized minimization methods is well understood both from the theoretical and computational points of view in this problem, the theoretical optimality of Bayesian estimators have not been explored yet. In this paper, we propose a Bayesian estimator for matrix completion under general sampling distribution. We also provide an oracle inequality for this estimator. This inequality proves that, whatever the rank of the matrix to be estimated, our estimator reaches the minimax-optimal rate of convergence (up to a logarithmic factor). We end the paper with a short simulation study.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint