Researcher profile

Shuangge Ma

Shuangge Ma contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2026arXiv

Heterogeneous gene network estimation for single-cell transcriptomic data via a joint regularized deep neural network

Estimation of intracellular gene networks has been a critical component of single-cell transcriptomic data analysis, which can provide crucial insights into the complex interplay between genes, facilitating the discovery of the biological basis of human life at single-cell resolution. Despite notable achievements, existing methodologies often falter in their practicality, primarily due to their narrow focus on simplistic linear relationships and inadequate handling of cellular heterogeneity. To bridge these gaps, we propose a joint regularized deep neural network method incorporating Mahalanobis distance-based K-means clustering (JRDNN-KM) to estimate multiple networks for various cell subgroups simultaneously, accounting for both unknown cellular heterogeneity and zero inflation, and, more importantly, complex nonlinear relationships among genes. We introduce an innovative selection layer for network construction, along with hidden layers that include both shared and subgroup-specific neurons, to capture common patterns and subgroup-specific variations across networks. Applied to real single-cell transcriptomic data from multiple tissues and species, JRDNN-KM demonstrates higher accuracy and biological interpretability in network estimation, and more accurately identifies cell subgroups compared to current state-of-the-art methods.Building on network construction, we further find hub genes with important biological implications and modules with statistical enrichment of biological processes.

preprint2023arXiv

Locally sparse quantile estimation for a partially functional interaction model

Functional data analysis has been extensively conducted. In this study, we consider a partially functional model, under which some covariates are scalars and have linear effects, while some other variables are functional and have unspecified nonlinear effects. Significantly advancing from the existing literature, we consider a model with interactions between the functional and scalar covariates. To accommodate long-tailed error distributions which are not uncommon in data analysis, we adopt the quantile technique for estimation. To achieve more interpretable estimation, and to accommodate many practical settings, we assume that the functional covariate effects are locally sparse (that is, there exist subregions on which the effects are exactly zero), which naturally leads to a variable/model selection problem. We propose respecting the "main effect, interaction" hierarchy, which postulates that if a subregion has a nonzero effect in an interaction term, then its effect has to be nonzero in the corresponding main functional effect. For estimation, identification of local sparsity, and respect of the hierarchy, we propose a penalization approach. An effective computational algorithm is developed, and the consistency properties are rigorously established under mild regularity conditions. Simulation shows the practical effectiveness of the proposed approach. The analysis of the Tecator data further demonstrates its practical applicability. Overall, this study can deliver a novel and practically useful model and a statistically and numerically satisfactory estimation approach.

preprint2022arXiv

Composite Expectile Regression with Gene-environment Interaction

If error distribution has heteroscedasticity, it voliates the assumption of linear regression. Expectile regression is a powerful tool for estimating the conditional expectiles of a response variable in this setting. Since multiple levels of expectile regression modelhas been well studied, we propose composite expectile regression by combining different levels of expectile regression to improve the efficacy. In this paper, we study the sparse composite expectile regression under high dimensional setting. It is realized by implementing a coordinate descent algorithm. We also prove its selection and estimation consistency. Simulations are conducted to demonstrate its performance, which is comparable to or better than the alternatives. We apply the proposed method to analyze Lung adenocarcinoma(LUAD) real data set, investigating the G-E interaction.

preprint2022arXiv

Statistical Methods for Accommodating Immortal Time: A Selective Review and Comparison

Epidemiologic studies and clinical trials with a survival outcome are often challenged by immortal time (IMT), a period of follow-up during which the survival outcome cannot occur because of the observed later treatment initiation. It has been well recognized that failing to properly accommodate IMT leads to biased estimation and misleading inference. Accordingly, a series of statistical methods have been developed, from the simplest by including or excluding IMT to various weightings and the more recent sequential methods. Our literature review suggests that the existing developments are often "scattered", and there is a lack of comprehensive review and direct comparison. To fill this knowledge gap and better introduce this important topic especially to biomedical researchers, we provide this review to comprehensively describe the available methods, discuss their advantages and disadvantages, and equally important, directly compare their performance via simulation and the analysis of the Stanford heart transplant data. The key observation is that the time-varying treatment modeling and sequential trial methods tend to provide unbiased estimation, while the other methods may result in substantial bias. We also provide an in-depth discussion on the interconnections with causal inference.

preprint2021arXiv

Gene-gene interaction analysis incorporating network information via a structured Bayesian approach

Increasing evidence has shown that gene-gene interactions have important effects on biological processes of human diseases. Due to the high dimensionality of genetic measurements, existing interaction analysis methods usually suffer from a lack of sufficient information and are still unsatisfactory. Biological networks have been massively accumulated, allowing researchers to identify biomarkers from a system perspective by utilizing network selection (consisting of functionally related biomarkers) as well as network structures. In the main-effect analysis, network information has been widely incorporated, leading to biologically more meaningful and more accurate estimates. However, there is still a big gap in the context of interaction analysis. In this study, we develop a novel structured Bayesian interaction analysis approach, effectively incorporating the network information. This study is among the first to identify gene-gene interactions with the assistance of network selection for phenotype prediction, while simultaneously accommodating the underlying network structures. It innovatively respects the multiple hierarchies among main effects, interactions, and networks. Bayesian method is adopted, which has been shown to have multiple advantages over some other techniques. An efficient variational inference algorithm is developed to explore the posterior distribution. Extensive simulation studies demonstrate the practical superiority of the proposed approach. The analysis of TCGA data on melanoma and lung cancer leads to biologically sensible findings with satisfactory prediction accuracy and selection stability.

preprint2020arXiv

Gene-Environment Interaction: A Variable Selection Perspective

Gene-environment interactions have important implications to elucidate the genetic basis of complex diseases beyond the joint function of multiple genetic factors and their interactions (or epistasis). In the past, G$\times$E interactions have been mainly conducted within the framework of genetic association studies. The high dimensionality of G$\times$E interactions, due to the complicated form of environmental effects and presence of a large number of genetic factors including gene expressions and SNPs, has motivated the recent development of penalized variable selection methods for dissecting G$\times$E interactions, which has been ignored in majority of published reviews on genetic interaction studies. In this article, we first survey existing overviews on both gene-environment and gene-gene interactions. Then, after a brief introduction on the variable selection methods, we review penalization and relevant variable selection methods in marginal and joint paradigms respectively under a variety of conceptual models. Discussions on strengths and limitations, as well as computational aspects of the variable selection methods tailored for G$\times$E studies have also been provided.

preprint2020arXiv

Histopathological imaging features- versus molecular measurements-based cancer prognosis modeling

For most if not all cancers, prognosis is of significant importance, and extensive modeling research has been conducted. With the genetic nature of cancer, in the past two decades, multiple types of molecular data (such as gene expressions and DNA mutations) have been explored. More recently, histopathological imaging data, which is routinely collected in biopsy, has been shown as informative for modeling prognosis. In this study, using the TCGA LUAD and LUSC data as a showcase, we examine and compare modeling lung cancer overall survival using gene expressions versus histopathological imaging features. High-dimensional regularization methods are adopted for estimation and selection. Our analysis shows that gene expressions have slightly better prognostic performance. In addition, most of the gene expressions are found to be weakly correlated imaging features. It is expected that this study can provide some insight into utilizing the two types of important data in cancer prognosis modeling and into lung cancer overall survival.

preprint2020arXiv

Integrative Sparse Partial Least Squares

Partial least squares, as a dimension reduction method, has become increasingly important for its ability to deal with problems with a large number of variables. Since noisy variables may weaken the performance of the model, the sparse partial least squares (SPLS) technique has been proposed to identify important variables and generate more interpretable results. However, the small sample size of a single dataset limits the performance of conventional methods. An effective solution comes from gathering information from multiple comparable studies. The integrative analysis holds an important status among multi-datasets analyses. The main idea is to improve estimation results by assembling raw datasets and analyzing them jointly. In this paper, we develop an integrative SPLS (iSPLS) method using penalization based on the SPLS technique. The proposed approach consists of two penalties. The first penalty conducts variable selection under the context of integrative analysis; The second penalty, a contrasted one, is imposed to encourage the similarity of estimates across datasets and generate more reasonable and accurate results. Computational algorithms are provided. Simulation experiments are conducted to compare iSPLS with alternative approaches. The practical utility of iSPLS is shown in the analysis of two TCGA gene expression data.

preprint2020arXiv

Robust Bayesian variable selection for gene-environment interactions

Gene-environment (G$\times$E) interactions have important implications to elucidate the etiology of complex diseases beyond the main genetic and environmental effects. Outliers and data contamination in disease phenotypes of G$\times$E studies have been commonly encountered, leading to the development of a broad spectrum of robust regularization methods. Nevertheless, within the Bayesian framework, the issue has not been taken care of in existing studies. We develop a fully Bayesian robust variable selection method for G$\times$E interaction studies. The proposed Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. In particular, for the robust sparse group selection, the spike-and-slab priors have been imposed on both individual and group levels to identify important main and interaction effects robustly. An efficient Gibbs sampler has been developed to facilitate fast computation. Extensive simulation studies and analysis of both the diabetes data with SNP measurements from the Nurses' Health Study and TCGA melanoma data with gene expression measurements demonstrate the superior performance of the proposed method over multiple competing alternatives.

preprint2020arXiv

Robust Identification of Gene-Environment Interactions under High-Dimensional Accelerated Failure Time Models

For complex diseases, beyond the main effects of genetic (G) and environmental (E) factors, gene-environment (G-E) interactions also play an important role. Many of the existing G-E interaction methods conduct marginal analysis, which may not appropriately describe disease biology. Joint analysis methods have been developed, with most of the existing loss functions constructed based on likelihood. In practice, data contamination is not uncommon. Development of robust methods for interaction analysis that can accommodate data contamination is very limited. In this study, we consider censored survival data and adopt an accelerated failure time (AFT) model. An exponential squared loss is adopted to achieve robustness. A sparse group penalization approach, which respects the "main effects, interactions" hierarchy, is adopted for estimation and identification. Consistency properties are rigorously established. Simulation shows that the proposed method outperforms direct competitors. In data analysis, the proposed method makes biologically sensible findings.