Source author record

Shane T. Jensen

Shane T. Jensen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Methodology math.ST Molecular Networks Statistics Theory Computation Genomics Populations and Evolution Quantitative Methods

Catalog footprint

What is connected

14works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Clustering Areal Units at Multiple Levels of Resolution to Model Crime in Philadelphia

Estimation of the spatial heterogeneity in crime incidence across an entire city is an important step towards reducing crime and increasing our understanding of the physical and social functioning of urban environments. This is a difficult modeling endeavor since crime incidence can vary smoothly across space and time but there also exist physical and social barriers that result in discontinuities in crime rates between different regions within a city. A further difficulty is that there are different levels of resolution that can be used for defining regions of a city in order to analyze crime. To address these challenges, we develop a Bayesian non-parametric approach for the clustering of urban areal units at different levels of resolution simultaneously. Our approach is evaluated with an extensive synthetic data study and then applied to the estimation of crime incidence at various levels of resolution in the city of Philadelphia.

preprint2022arXiv

Crime in Philadelphia: Bayesian Clustering with Particle Optimization

Accurate estimation of the change in crime over time is a critical first step towards better understanding of public safety in large urban environments. Bayesian hierarchical modeling is a natural way to study spatial variation in urban crime dynamics at the neighborhood level, since it facilitates principled ``sharing of information'' between spatially adjacent neighborhoods. Typically, however, cities contain many physical and social boundaries that may manifest as spatial discontinuities in crime patterns. In this situation, standard prior choices often yield overly-smooth parameter estimates, which can ultimately produce mis-calibrated forecasts. To prevent potential over-smoothing, we introduce a prior that partitions the set of neighborhoods into several clusters and encourages spatial smoothness within each cluster. In terms of model implementation, conventional stochastic search techniques are computationally prohibitive, as they must traverse a combinatorially vast space of partitions. We introduce an ensemble optimization procedure that simultaneously identifies several high probability partitions by solving one optimization problem using a new local search strategy. We then use the identified partitions to estimate crime trends in Philadelphia between 2006 and 2017. On simulated and real data, our proposed method demonstrates good estimation and partition selection performance.

preprint2016arXiv

Partial Information Framework: Model-Based Aggregation of Estimates from Diverse Information Sources

Prediction polling is an increasingly popular form of crowdsourcing in which multiple participants estimate the probability or magnitude of some future event. These estimates are then aggregated into a single forecast. Historically, randomness in scientific estimation has been generally assumed to arise from unmeasured factors which are viewed as measurement noise. However, when combining subjective estimates, heterogeneity stemming from differences in the participants' information is often more important than measurement noise. This paper formalizes information diversity as an alternative source of such heterogeneity and introduces a novel modeling framework that is particularly well-suited for prediction polls. A practical specification of this framework is proposed and applied to the task of aggregating probability and point estimates from two real-world prediction polls. In both cases our model outperforms standard measurement-error-based aggregators, hence providing evidence in favor of information diversity being the more important source of heterogeneity.

preprint2015arXiv

Locating recombination hot spots in genomic sequences through the singular value decomposition

Locating recombination hotspots in genomic data is an important but difficult task. Current methods frequently rely on estimating complicated models at high computational cost. In this paper we develop an extremely fast, scalable method for inferring recombination hot spots in a population of genomic sequences that is based on the singular value decomposition. Our method performs well in several synthetic data scenarios. We also apply our technique to a real data investigation of the evolution of drug therapy resistance in a population of HIV genomic sequences. Finally, we compare our method both on real and simulated data to a state of the art algorithm.

preprint2015arXiv

openWAR: An Open Source System for Evaluating Overall Player Performance in Major League Baseball

Within baseball analytics, there is substantial interest in comprehensive statistics intended to capture overall player performance. One such measure is Wins Above Replacement (WAR), which aggregates the contributions of a player in each facet of the game: hitting, pitching, baserunning, and fielding. However, current versions of WAR depend upon proprietary data, ad hoc methodology, and opaque calculations. We propose a competitive aggregate measure, openWAR, that is based upon public data and methodology with greater rigor and transparency. We discuss a principled standard for the nebulous concept of a "replacement" player. Finally, we use simulation-based techniques to provide interval estimates for our openWAR measure.

preprint2015arXiv

Power Weighted Densities for Time Series Data

While time series prediction is an important, actively studied problem, the predictive accuracy of time series models is complicated by non-stationarity. We develop a fast and effective approach to allow for non-stationarity in the parameters of a chosen time series model. In our power-weighted density (PWD) approach, observations in the distant past are down-weighted in the likelihood function relative to more recent observations, while still giving the practitioner control over the choice of data model. One of the most popular non-stationary techniques in the academic finance community, rolling window estimation, is a special case of our PWD approach. Our PWD framework is a simpler alternative compared to popular state-space methods that explicitly model the evolution of an underlying state vector. We demonstrate the benefits of our PWD approach in terms of predictive performance compared to both stationary models and alternative non-stationary methods. In a financial application to thirty industry portfolios, our PWD method has a significantly favorable predictive performance and draws a number of substantive conclusions about the evolution of the coefficients and the importance of market factors over time.

preprint2015arXiv

Probabilistic Approach for Evaluating Metabolite Sample Integrity

The success of metabolomics studies depends upon the "fitness" of each biological sample used for analysis: it is critical that metabolite levels reported for a biological sample represent an accurate snapshot of the studied organism's metabolite profile at time of sample collection. Numerous factors may compromise metabolite sample fitness, including chemical and biological factors which intervene during sample collection, handling, storage, and preparation for analysis. We propose a probabilistic model for the quantitative assessment of metabolite sample fitness. Collection and processing of nuclear magnetic resonance (NMR) and ultra-performance liquid chromatography (UPLC-MS) metabolomics data is discussed. Feature selection methods utilized for multivariate data analysis are briefly reviewed, including feature clustering and computation of latent vectors using spectral methods. We propose that the time-course of metabolite changes in samples stored at different temperatures may be utilized to identify changing-metabolite-to-stable-metabolite ratios as markers of sample fitness. Tolerance intervals may be computed to characterize these ratios among fresh samples. In order to discover additional structure in the data relevant to sample fitness, we propose using data labeled according to these ratios to train a Dirichlet process mixture model (DPMM) for assessing sample fitness. DPMMs are highly intuitive since they model the metabolite levels in a sample as arising from a combination of processes including, e.g., normal biological processes and degradation- or contamination-inducing processes. The outputs of a DPMM are probabilities that a sample is associated with a given process, and these probabilities may be incorporated into a final classifier for sample fitness.

preprint2014arXiv

Probability aggregation in time-series: Dynamic hierarchical modeling of sparse expert beliefs

Most subjective probability aggregation procedures use a single probability judgment from each expert, even though it is common for experts studying real problems to update their probability estimates over time. This paper advances into unexplored areas of probability aggregation by considering a dynamic context in which experts can update their beliefs at random intervals. The updates occur very infrequently, resulting in a sparse data set that cannot be modeled by standard time-series procedures. In response to the lack of appropriate methodology, this paper presents a hierarchical model that takes into account the expert's level of self-reported expertise and produces aggregate probabilities that are sharp and well calibrated both in- and out-of-sample. The model is demonstrated on a real-world data set that includes over 2300 experts making multiple probability forecasts over two years on different subsets of 166 international political events.

preprint2014arXiv

Variable selection for BART: An application to gene regulation

We consider the task of discovering gene regulatory networks, which are defined as sets of genes and the corresponding transcription factors which regulate their expression levels. This can be viewed as a variable selection problem, potentially with high dimensionality. Variable selection is especially challenging in high-dimensional settings, where it is difficult to detect subtle individual effects and interactions between predictors. Bayesian Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266-298] provides a novel nonparametric alternative to parametric regression approaches, such as the lasso or stepwise regression, especially when the number of relevant predictors is sparse relative to the total number of available predictors and the fundamental relationships are nonlinear. We develop a principled permutation-based inferential approach for determining when the effect of a selected predictor is likely to be real. Going further, we adapt the BART procedure to incorporate informed prior information about variable importance. We present simulations demonstrating that our method compares favorably to existing parametric and nonparametric procedures in a variety of data settings. To demonstrate the potential of our approach in a biological context, we apply it to the task of inferring the gene regulatory network in yeast (Saccharomyces cerevisiae). We find that our BART-based procedure is best able to recover the subset of covariates with the largest signal compared to other variable selection methods. The methods developed in this work are readily available in the R package bartMachine.

preprint2013arXiv

Estimating Player Contribution in Hockey with Regularized Logistic Regression

We present a regularized logistic regression model for evaluating player contributions in hockey. The traditional metric for this purpose is the plus-minus statistic, which allocates a single unit of credit (for or against) to each player on the ice for a goal. However, plus-minus scores measure only the marginal effect of players, do not account for sample size, and provide a very noisy estimate of performance. We investigate a related regression problem: what does each player on the ice contribute, beyond aggregate team performance and other factors, to the odds that a given goal was scored by their team? Due to the large-p (number of players) and imbalanced design setting of hockey analysis, a major part of our contribution is a careful treatment of prior shrinkage in model estimation. We showcase two recently developed techniques -- for posterior maximization or simulation -- that make such analysis feasible. Each approach is accompanied with publicly available software and we include the simple commands used in our analysis. Our results show that most players do not stand out as measurably strong (positive or negative) contributors. This allows the stars to really shine, reveals diamonds in the rough overlooked by earlier analyses, and argues that some of the highest paid players in the league are not making contributions worth their expense.

preprint2012arXiv

A Level-Set Hit-and-Run Sampler for Quasi-Concave Distributions

We develop a new sampling strategy that uses the hit-and-run algorithm within level sets of the target density. Our method can be applied to any quasi-concave density, which covers a broad class of models. Our sampler performs well in high-dimensional settings, which we illustrate with a comparison to Gibbs sampling on a spike-and-slab mixture model. We also extend our method to exponentially-tilted quasi-concave densities, which arise often in Bayesian models consisting of a log-concave likelihood and quasi-concave prior density. Within this class of models, our method is effective at sampling from posterior distributions with high dependence between parameters, which we illustrate with a simple multivariate normal example. We also implement our level-set sampler on a Cauchy-normal model where we demonstrate the ability of our level set sampler to handle multi-modal posterior distributions.

preprint2012arXiv

Refining the Protein-Protein Interactome using Gene Expression Data

Proteins interact with other proteins within biological pathways, forming connected subgraphs in the protein-protein interactome (PPI). Proteins are often involved in multiple biological pathways which complicates interpretation of interactions between proteins. Gene expression data can assist our inference since genes within a particular pathway tend to have more correlated expression patterns than genes from distinct pathways. We provide an algorithm that uses gene expression information to remove inter-pathway protein-protein interactions, thereby simplifying the structure of the protein-protein interactome. This refined topology permits easier interpretation and greater biological coherence of multiple biological pathways simultaneously.

preprint2010arXiv

An Alternative Prior Process for Nonparametric Bayesian Clustering

Prior distributions play a crucial role in Bayesian approaches to clustering. Two commonly-used prior distributions are the Dirichlet and Pitman-Yor processes. In this paper, we investigate the predictive probabilities that underlie these processes, and the implicit "rich-get-richer" characteristic of the resulting partitions. We explore an alternative prior for nonparametric Bayesian clustering -- the uniform process -- for applications where the "rich-get-richer" property is undesirable. We also explore the cost of this process: partitions are no longer exchangeable with respect to the ordering of variables. We present new asymptotic and simulation-based results for the clustering characteristics of the uniform process and compare these with known results for the Dirichlet and Pitman-Yor processes. We compare performance on a real document clustering task, demonstrating the practical advantage of the uniform process despite its lack of exchangeability over orderings.

preprint2007arXiv

Bayesian variable selection and data integration for biological regulatory networks

A substantial focus of research in molecular biology are gene regulatory networks: the set of transcription factors and target genes which control the involvement of different biological processes in living cells. Previous statistical approaches for identifying gene regulatory networks have used gene expression data, ChIP binding data or promoter sequence data, but each of these resources provides only partial information. We present a Bayesian hierarchical model that integrates all three data types in a principled variable selection framework. The gene expression data are modeled as a function of the unknown gene regulatory network which has an informed prior distribution based upon both ChIP binding and promoter sequence data. We also present a variable weighting methodology for the principled balancing of multiple sources of prior information. We apply our procedure to the discovery of gene regulatory relationships in Saccharomyces cerevisiae (Yeast) for which we can use several external sources of information to validate our results. Our inferred relationships show greater biological relevance on the external validation measures than previous data integration methods. Our model also estimates synergistic and antagonistic interactions between transcription factors, many of which are validated by previous studies. We also evaluate the results from our procedure for the weighting for multiple sources of prior information. Finally, we discuss our methodology in the context of previous approaches to data integration and Bayesian variable selection.

Shane T. Jensen

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

Clustering Areal Units at Multiple Levels of Resolution to Model Crime in Philadelphia

Crime in Philadelphia: Bayesian Clustering with Particle Optimization

Partial Information Framework: Model-Based Aggregation of Estimates from Diverse Information Sources

Locating recombination hot spots in genomic sequences through the singular value decomposition

openWAR: An Open Source System for Evaluating Overall Player Performance in Major League Baseball

Power Weighted Densities for Time Series Data

Probabilistic Approach for Evaluating Metabolite Sample Integrity

Probability aggregation in time-series: Dynamic hierarchical modeling of sparse expert beliefs

Variable selection for BART: An application to gene regulation

Estimating Player Contribution in Hockey with Regularized Logistic Regression

A Level-Set Hit-and-Run Sampler for Quasi-Concave Distributions

Refining the Protein-Protein Interactome using Gene Expression Data

An Alternative Prior Process for Nonparametric Bayesian Clustering

Bayesian variable selection and data integration for biological regulatory networks