Source author record

George Tseng

George Tseng appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Genomics Machine Learning Methodology Quantitative Methods

Catalog footprint

What is connected

3works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

On p-value combination of independent and frequent signals: asymptotic efficiency and Fisher ensemble

Combining p-values to integrate multiple effects is of long-standing interest in social science and biomedical research. In this paper, we focus on revisiting a classical scenario closely related to meta-analysis, which combines a relatively small (finite and fixed) number of p-values while the sample size for generating each p-value is large (asymptotically goes to infinity). We evaluate a list of traditional and recently developed modified Fisher's methods to investigate their asymptotic efficiencies and finite-sample numerical performance. The result concludes Fisher and adaptively weighted Fisher method to have top performance and complementary advantages across different proportions of true signals. Finally, we propose an ensemble method, namely Fisher ensemble, to combine the two top-performing Fisher-related methods using a robust truncated Cauchy ensemble approach. We show that Fisher ensemble achieves asymptotic Bahadur optimality and integrates the strengths of Fisher and adaptively weighted Fisher methods in simulations. We subsequently extend Fisher ensemble to a variant with emphasized power for concordant effect size directions. A transcriptomic meta-analysis application confirms the theoretical and simulation conclusions, generates intriguing biomarker and pathway findings and demonstrates strengths and strategy of using proposed Fisher ensemble methods.

preprint2022arXiv

Outcome-guided Sparse K-means for Disease Subtype Discovery via Integrating Phenotypic Data with High-dimensional Transcriptomic Data

The discovery of disease subtypes is an essential step for developing precision medicine, and disease subtyping via omics data has become a popular approach. While promising, subtypes obtained from existing approaches are not necessarily associated with clinical outcomes. With the rich clinical data along with the omics data in modern epidemiology cohorts, it is urgent to develop an outcome-guided clustering algorithm to fully integrate the phenotypic data with the high-dimensional omics data. Hence, we extended a sparse K-means method to an outcome-guided sparse K-means (GuidedSparseKmeans) method. An unified objective function was proposed, which was comprised of (i) weighted K-means to perform sample clusterings; (ii) lasso regularizations to perform gene selection from the high-dimensional omics data; (iii) incorporation of a phenotypic variable from the clinical dataset to facilitate biologically meaningful clustering results. By iteratively optimizing the objective function, we will simultaneously obtain a phenotype-related sample clustering results and gene selection results. We demonstrated the superior performance of the GuidedSparseKmeans by comparing with existing clustering methods in simulations and applications of high-dimensional transcriptomic data of breast cancer and Alzheimer's disease. Our algorithm has been implemented into an R package, which is publicly available on GitHub (https://github.com/LingsongMeng/GuidedSparseKmeans).

preprint2020arXiv

A sparse negative binomial mixture model for clustering RNA-seq count data

Clustering with variable selection is a challenging yet critical task for modern small-n-large-p data. Existing methods based on sparse Gaussian mixture models or sparse K-means provide solutions to continuous data. With the prevalence of RNA-seq technology and lack of count data modeling for clustering, the current practice is to normalize count expression data into continuous measures and apply existing models with Gaussian assumption. In this paper, we develop a negative binomial mixture model with lasso or fused lasso gene regularization to cluster samples (small n) with high-dimensional gene features (large p). EM algorithm and Bayesian information criterion are used for inference and determining tuning parameters. The method is compared with existing methods using extensive simulations and two real transcriptomic applications in rat brain and breast cancer studies. The result shows superior performance of the proposed count data model in clustering accuracy, feature selection and biological interpretation in pathways.