Source author record

Sandra E. Safo

Sandra E. Safo appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications Machine Learning Quantitative Methods

Catalog footprint

What is connected

5works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

sJIVE: Supervised Joint and Individual Variation Explained

Analyzing multi-source data, which are multiple views of data on the same subjects, has become increasingly common in molecular biomedical research. Recent methods have sought to uncover underlying structure and relationships within and/or between the data sources, and other methods have sought to build a predictive model for an outcome using all sources. However, existing methods that do both are presently limited because they either (1) only consider data structure shared by all datasets while ignoring structures unique to each source, or (2) they extract underlying structures first without consideration to the outcome. We propose a method called supervised joint and individual variation explained (sJIVE) that can simultaneously (1) identify shared (joint) and source-specific (individual) underlying structure and (2) build a linear prediction model for an outcome using these structures. These two components are weighted to compromise between explaining variation in the multi-source data and in the outcome. Simulations show sJIVE to outperform existing methods when large amounts of noise are present in the multi-source data. An application to data from the COPDGene study reveals gene expression and proteomic patterns that are predictive of lung function. Functions to perform sJIVE are included in the R.JIVE package, available online at http://github.com/lockEF/r.jive .

preprint2020arXiv

Bayesian Integrative Analysis and Prediction with Application to Atherosclerosis Cardiovascular Disease

Cardiovascular diseases (CVD), including atherosclerosis CVD (ASCVD), are multifactorial diseases that present a major economic and social burden worldwide. Tremendous efforts have been made to understand traditional risk factors for ASCVD, but these risk factors account for only about half of all cases of ASCVD. It remains a critical need to identify nontraditional risk factors (e.g., genetic variants, genes) contributing to ASCVD. Further, incorporating functional knowledge in prediction models have the potential to reveal pathways associated with disease risk. We propose Bayesian hierarchical factor analysis models that associate multiple omics data, predict a clinical outcome, allow for prior functional information, and can accommodate clinical covariates. The models, motivated by available data and the need for other risk factors of ASCVD, are used for the integrative analysis of clinical, demographic, and multi-omics data to identify genetic variants, genes, and gene pathways potentially contributing to 10-year ASCVD risk in healthy adults. Our findings revealed several genetic variants, genes and gene pathways that were highly associated with ASCVD risk. Interestingly, some of these have been implicated in CVD risk. The others could be explored for their potential roles in CVD. Our findings underscore the merit in joint association and prediction models.

preprint2020arXiv

Sparse Linear Discriminant Analysis for Multi-view Structured Data

Classification methods that leverage the strengths of data from multiple sources (multi-view data) simultaneously have enormous potential to yield more powerful findings than two step methods: association followed by classification. We propose two methods, sparse integrative discriminant analysis (SIDA) and SIDA with incorporation of network information (SIDANet), for joint association and classification studies. The methods consider the overall association between multi-veiw data, and the separation within each view in choosing discriminant vectors that are associated and optimally separate subjects into different classes. SIDANet is among the first methods to incorporate prior structural information in joint association and classification studies. It uses the normalized Laplacian of a graph to smooth coefficients of predictor variables, thus encouraging selection of predictors that are connected and behave similarly. We demonstrate the effectiveness of our methods on a set of synthetic and real datasets. Our findings underscore the benefit of joint association and classification methods if the goal is to correlate multi-view data and to perform classification.

preprint2016arXiv

Integrative analysis of transcriptomic and metabolomic data via sparse canonical correlation analysis with incorporation of biological information

Integrative analyses of different high dimensional data types are becoming increasingly popular. Similarly, incorporating prior functional relationships among variables in data analysis has been a topic of increasing interest as it helps elucidate underlying mechanisms among complex diseases. In this paper, the goal is to assess association between transcriptomic and metabolomic data from a Predictive Health Institute (PHI) study including healthy adults at high risk of developing cardiovascular diseases. To this end, we develop statistical methods for identifying sparse structure in canonical correlation analysis (CCA) with incorporation of biological/structural information. Our proposed methods use prior network structural information among genes and among metabolites to guide selection of relevant genes and metabolites in sparse CCA, providing insight on the molecular underpinning of cardiovascular disease. Our simulations demonstrate that the structured sparse CCA methods outperform several existing sparse CCA methods in selecting relevant genes and metabolites when structural information is informative and are robust to mis-specified structural information. Our analysis of the PHI study reveals that a number of genes and metabolic pathways including some known to be associated with cardiovascular diseases are enriched in the subset of genes and metabolites selected by our proposed approach.

preprint2016arXiv

Sparse Generalized Eigenvalue Problem with Application to Canonical Correlation Analysis for Integrative Analysis of Methylation and Gene Expression Data

We present a method for individual and integrative analysis of high dimension, low sample size data that capitalizes on the recurring theme in multivariate analysis of projecting higher dimensional data onto a few meaningful directions that are solutions to a generalized eigenvalue problem. We propose a general framework, called SELP (Sparse Estimation with Linear Programming), with which one can obtain a sparse estimate for a solution vector of a generalized eigenvalue problem. We demonstrate the utility of SELP on canonical correlation analysis for an integrative analysis of methylation and gene expression profiles from a breast cancer study, and we identify some genes known to be associated with breast carcinogenesis, which indicates that the proposed method is capable of generating biologically meaningful insights. Simulation studies suggest that the proposed method performs competitive in comparison with some existing methods in identifying true signals in various underlying covariance structures.

Sandra E. Safo

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

sJIVE: Supervised Joint and Individual Variation Explained

Bayesian Integrative Analysis and Prediction with Application to Atherosclerosis Cardiovascular Disease

Sparse Linear Discriminant Analysis for Multi-view Structured Data

Integrative analysis of transcriptomic and metabolomic data via sparse canonical correlation analysis with incorporation of biological information

Sparse Generalized Eigenvalue Problem with Application to Canonical Correlation Analysis for Integrative Analysis of Methylation and Gene Expression Data