Source author record

Le Bao

Le Bao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Methodology Genomics Machine Learning

Catalog footprint

What is connected

12works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Bayesian Multi-State Data Integration Approach for Estimating County-level Prevalence of Opioid Misuse in the United States

Drug overdose deaths, including from opioids, remain a significant public health threat to the United States (US). To abate the harms of opioid misuse, understanding its prevalence at the local level is crucial for stakeholders in communities to develop response strategies that effectively use limited resources. Although there exist several state-specific studies that provide county-level prevalence estimates, such estimates are not widely available across the country, as the datasets used in these studies are not always readily available in other states, which, therefore, has limited the wider applications of existing models. To fill this gap, we propose a Bayesian multi-state data integration approach that fully utilizes publicly available data sources to estimate county-level opioid misuse prevalence for all counties in the US. The hierarchical structure jointly models opioid misuse prevalence and overdose death outcomes, leverages existing county-level prevalence estimates in limited states and state-level estimates from national surveys, and accounts for heterogeneity across counties and states with counties' covariates and mixed effects. Furthermore, our parsimonious and generalizable modeling framework employs horseshoe+ prior to flexibly shrink coefficients and prevent overfitting, ensuring adaptability as new county-level prevalence data in additional states become available. Using real-world data, our model shows high estimation accuracy through cross-validation and provides nationwide county-level estimates of opioid misuse for the first time.

preprint2022arXiv

Causal Structural Learning on MPHIA Individual Dataset

The Population-based HIV Impact Assessment (PHIA) is an ongoing project that conducts nationally representative HIV-focused surveys for measuring national and regional progress toward UNAIDS' 90-90-90 targets, the primary strategy to end the HIV epidemic. We believe the PHIA survey offers a unique opportunity to better understand the key factors that drive the HIV epidemics in the most affected countries in sub-Saharan Africa. In this article, we propose a novel causal structural learning algorithm to discover important covariates and potential causal pathways for 90-90-90 targets. Existing constrained-based causal structural learning algorithms are quite aggressive in edge removal. The proposed algorithm preserves more information about important features and potential causal pathways. It is applied to the Malawi PHIA (MPHIA) data set and leads to interesting results. For example, it discovers age and condom usage to be important for female HIV awareness; the number of sexual partners to be important for male HIV awareness; and knowing the travel time to HIV care facilities leads to a higher chance of being treated for both females and males. We further compare and validate the proposed algorithm using BIC and using Monte Carlo simulations, and show that the proposed algorithm achieves improvement in true positive rates in important feature discovery over existing algorithms.

preprint2022arXiv

Information Borrowing in Regression Models

Model development often takes data structure, subject matter considerations, model assumptions, and goodness of fit into consideration. To diagnose issues with any of these factors, it can be helpful to understand regression model estimates at a more granular level. We propose a new method for decomposing point estimates from a regression model via weights placed on data clusters. The weights are informed only by the model specification and data availability and thus can be used to explicitly link the effects of data imbalance and model assumptions to actual model estimates. The weight matrix has been understood in linear models as the hat matrix in the existing literature. We extend it to Bayesian hierarchical regression models that incorporate prior information and complicated dependence structures through the covariance among random effects. We show that the model weights, which we call borrowing factors, generalize shrinkage and information borrowing to all regression models. In contrast, the focus of the hat matrix has been mainly on the diagonal elements indicating the amount of leverage. We also provide metrics that summarize the borrowing factors and are practically useful. We present the theoretical properties of the borrowing factors and associated metrics and demonstrate their usage in two examples. By explicitly quantifying borrowing and shrinkage, researchers can better incorporate domain knowledge and evaluate model performance and the impacts of data properties such as data imbalance or influential points.

preprint2020arXiv

A Joint Spatial Conditional Auto-Regressive Model for Estimating HIV Prevalence Rates Among Key Populations

Ending the HIV/AIDS pandemic is among the Sustainable Development Goals for the next decade. In order to overcome the gap between the need for care and the available resources, better understanding of HIV epidemics is needed to guide policy decisions, especially for key populations that are at higher risk for HIV infection. Accurate HIV epidemic estimates for key populations have been difficult to obtain because their HIV surveillance data is very limited. In this paper, we propose a so-called joint spatial conditional auto-regressive model for estimating HIV prevalence rates among key populations. Our model borrows information from both neighboring locations and dependent populations. As illustrated in the real data analysis, it provides more accurate estimates than independently fitting the sub-epidemic for each key population. In addition, we provide a study to reveal the conditions that our proposal gives a better prediction. The study combines both theoretical investigation and numerical study, revealing strength and limitations of our proposal.

preprint2020arXiv

Evaluating the relative contribution of data sources in a Bayesian analysis with the application of estimating the size of hard to reach populations

When using multiple data sources in an analysis, it is important to understand the influence of each data source on the analysis and the consistency of the data sources with each other and the model. We suggest the use of a retrospective value of information framework in order to address such concerns. Value of information methods can be computationally difficult. We illustrate the use of computational methods that allow these methods to be applied even in relatively complicated settings. In illustrating the proposed methods, we focus on an application in estimating the size of hard to reach populations. Specifically, we consider estimating the number of injection drug users in Ukraine by combining all available data sources spanning over half a decade and numerous sub-national areas in the Ukraine. This application is of interest to public health researchers as this hard to reach population that plays a large role in the spread of HIV. We apply a Bayesian hierarchical model and evaluate the contribution of each data source in terms of absolute influence, expected influence, and level of surprise. Finally we apply value of information methods to inform suggestions on future data collection.

preprint2020arXiv

What Can We Learn from the Travelers Data in Detecting Disease Outbreaks -- A Case Study of the COVID-19 Epidemic

Background: Travel is a potent force in the emergence of disease. We discussed how the traveler case reports could aid in a timely detection of a disease outbreak. Methods: Using the traveler data, we estimated a few indicators of the epidemic that affected decision making and policy, including the exponential growth rate, the doubling time, and the probability of severe cases exceeding the hospital capacity, in the initial phase of the COVID-19 epidemic in multiple countries. We imputed the arrival dates when they were missing. We compared the estimates from the traveler data to the ones from domestic data. We quantitatively evaluated the influence of each case report and knowing the arrival date on the estimation. Findings: We estimated the travel origin's daily exponential growth rate and examined the date from which the growth rate was consistently above 0.1 (equivalent to doubling time < 7 days). We found those dates were very close to the dates that critical decisions were made such as city lock-downs and national emergency announcement. Using only the traveler data, if the assumed epidemic start date was relatively accurate and the traveler sample was representative of the general population, the growth rate estimated from the traveler data was consistent with the domestic data. We also discussed situations that the traveler data could lead to biased estimates. From the data influence study, we found more recent travel cases had a larger influence on each day's estimate, and the influence of each case report got smaller as more cases became available. We provided the minimum number of exported cases needed to determine whether the local epidemic growth rate was above a certain level, and developed a user-friendly Shiny App to accommodate various scenarios.

preprint2017arXiv

Assignment of endogenous retrovirus integration sites using a mixture model

Structural variation occurs in the genomes of individuals because of the different positions occupied by repetitive genome elements like endogenous retroviruses, or ERVs. The presence or absence of ERVs can be determined by identifying the junction with the host genome using high-throughput sequence technology and a clustering algorithm. The resulting data give the number of sequence reads assigned to each ERV-host junction sequence for each sampled individual. Variability in the number of reads from an individual integration site makes it difficult to determine whether a site is present for low read counts. We present a novel two-component mixture of negative binomial distributions to model these counts and assign a probability that a given ERV is present in a given individual. We explain how our approach is superior to existing alternatives, including another form of two-component mixture model and the much more common approach of selecting a threshold count for declaring the presence of an ERV. We apply our method to a data set of ERV integrations in mule deer [Odocoileus hemionus], a species for which no genomic resources are available, and demonstrate that the discovered patterns of shared integration sites contain information about animal relatedness.

preprint2016arXiv

Clustering pipeline for determining consensus sequences in targeted next-generation sequencing

Analyses of targeted genomic sequencing data from next-generation-sequencing (NGS) technologies typically involves mapping reads to a reference sequence or clustering reads. For a number of species a reference genome is not available so the analyses of targeted sequencing data, for example polymorphic structural variation caused by mobile elements is difficult; clustering methods are preferred for such data analysis. Clustering of reads requires a clustering threshold parameter, which is used to compare and group reads. However, determining the optimal clustering threshold for a read dataset is challenging because of different sequence composition, the number of sequences present, and also the amount of sequencing errors in the dataset. High values of the clustering threshold parameter can falsely inflate the number of recovered genomic regions, while low values of clustering threshold can merge reads from distinct regions into a single cluster. Thus, an algorithm that can empirically determine clustering threshold is needed. We propose a pipeline for clustering genomic sequences wherein the clustering threshold is empirically determined from the NGS data. The optimal threshold is decided based on two internal clustering measures which assess clusters for small intra-cluster diameters and large inter-cluster distances. We evaluate the pipeline on two simulated datasets derived from human genome sequence simulating different genomic regions and sequencing depth. The total number of clusters obtained from our pipeline is closer to the actual number of reference sequences when compared to single round of clustering. Also, the number of clusters whose consensus sequence matches a corresponding reference sequence is higher in our pipeline. We observe that the presence of repeat regions affects clustering accuracy.

preprint2016arXiv

Incorporating Hierarchical Structure Into Dynamic Systems: An Application Of Estimating HIV Epidemics At Sub-National And Sub-Population Level

Dynamic models have been successfully used in producing estimates of HIV epidemics at national level, due to their epidemiological nature and their ability to simultaneously estimate prevalence, incidence, and mortality rates. Recently, HIV interventions and policies have required more information at sub-national and sub-population levels to support local planning, decision making and resource allocation. Unfortunately, many areas and high-risk groups lack sufficient data for deriving stable and reliable results, and this is a critical technical barrier to more stratified estimates. One solution is to borrow information from other areas and groups within the same country. However, directly assuming hierarchical structures within the HIV dynamic models is complicated and computationally time consuming. In this paper, we propose a simple and innovative way to incorporate the hierarchical information into the dynamic systems by using auxiliary data. The proposed method efficiently uses information from multiple areas and risk groups within each country without increasing the computational burden. As a result, the new model improves predictive ability in general with especially significant improvement in areas and risk groups with sparse data.

preprint2015arXiv

Estimating HIV Epidemics for Sub-National Areas

As the global HIV pandemic enters its fourth decade, increasing numbers of surveillance sites have been established which allows countries to look into the epidemics at a finer scale, e.g. at sub-national levels. Currently, the epidemic models have been applied independently to the sub-national areas within countries. However, the availability and quality of the data vary widely, which leads to biased and unreliable estimates for areas with very few data. We propose to overcome this issue by introducing the dependence of the parameters across areas in a mixture model. The joint distribution of the parameters in multiple areas can be approximated directly from the results of independent fits without needing to refit the data or unpack the software. As a result, the mixture model has better predictive ability than the independent model as shown in examples of multiple countries in Sub-Saharan Africa.

preprint2014arXiv

A Hierarchical Model for Estimating HIV Epidemics

As the global HIV pandemic enters its fourth decade, increasing numbers of surveillance sites have been established which allows countries to look into the epidemics at a finer scale, e.g. at sub-national level. However, the epidemic models have been applied independently to the sub-national areas within countries. An important technical barrier is that the availability and quality of the data vary widely from area to area, and many areas lack data for deriving stable and reliable results. To improve the accuracy of the results in areas with little data, we propose a hierarchical model that utilizes information efficiently by assuming similar characteristics of the epidemics across areas within one country. The joint distribution of the parameters in the hierarchical model can be approximated directly from the results of independent fits without needing to the refit the data. As a result, the hierarchical model has better predictive ability than the independent model as shown in examples of multiple countries in Sub-Saharan Africa.

preprint2012arXiv

Estimation and Clustering with Infinite Rankings

This paper presents a natural extension of stagewise ranking to the the case of infinitely many items. We introduce the infinite generalized Mallows model (IGM), describe its properties and give procedures to estimate it from data. For estimation of multimodal distributions we introduce the Exponential-Blurring-Mean-Shift nonparametric clustering algorithm. The experiments highlight the properties of the new model and demonstrate that infinite models can be simple, elegant and practical.

Le Bao

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

A Bayesian Multi-State Data Integration Approach for Estimating County-level Prevalence of Opioid Misuse in the United States

Causal Structural Learning on MPHIA Individual Dataset

Information Borrowing in Regression Models

A Joint Spatial Conditional Auto-Regressive Model for Estimating HIV Prevalence Rates Among Key Populations

Evaluating the relative contribution of data sources in a Bayesian analysis with the application of estimating the size of hard to reach populations

What Can We Learn from the Travelers Data in Detecting Disease Outbreaks -- A Case Study of the COVID-19 Epidemic

Assignment of endogenous retrovirus integration sites using a mixture model

Clustering pipeline for determining consensus sequences in targeted next-generation sequencing

Incorporating Hierarchical Structure Into Dynamic Systems: An Application Of Estimating HIV Epidemics At Sub-National And Sub-Population Level

Estimating HIV Epidemics for Sub-National Areas

A Hierarchical Model for Estimating HIV Epidemics

Estimation and Clustering with Infinite Rankings