Source author record

Kory D. Johnson

Kory D. Johnson appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology math.ST Populations and Evolution Statistics Theory Machine Learning Quantitative Methods

Catalog footprint

What is connected

7works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Impartial Predictive Modeling and the Use of Proxy Variables

Fairness aware data mining (FADM) aims to prevent algorithms from discriminating against protected groups. The literature has come to an impasse as to what constitutes explainable variability as opposed to discrimination. This distinction hinges on a rigorous understanding of the role of proxy variables; i.e., those variables which are associated both the protected feature and the outcome of interest. We demonstrate that fairness is achieved by ensuring impartiality with respect to sensitive characteristics and provide a framework for impartiality by accounting for different perspectives on the data generating process. In particular, fairness can only be precisely defined in a full-data scenario in which all covariates are observed. We then analyze how these models may be conservatively estimated via regression in partial-data settings. Decomposing the regression estimates provides insights into previously unexplored distinctions between explainable variability and discrimination that illuminate the use of proxy variables in fairness aware data mining.

preprint2022arXiv

Robust models of SARS-CoV-2 heterogeneity and control

In light of the continuing emergence of new SARS-CoV-2 variants and vaccines, we create a simulation framework for exploring possible infection trajectories under various scenarios. The situations of primary interest involve the interaction between three components: vaccination campaigns, non-pharmaceutical interventions (NPIs), and the emergence of new SARS-CoV-2 variants. Additionally, immunity waning and vaccine boosters are modeled to account for their growing importance. New infections are generated according to a hierarchical model in which people have a random, individual infectiousness. The model thus includes super-spreading observed in the COVID-19 pandemic. Our simulation functions as a dynamic compartment model in which an individual's history of infection, vaccination, and possible reinfection all play a role in their resistance to further infections. We present a risk measure for each SARS-CoV-2 variant, $ρ^\V$, that accounts for the amount of resistance within a population and show how this risk changes as the vaccination rate increases. Furthermore, by considering different population compositions in terms of previous infection and type of vaccination, we can learn about variants which pose differential risk to different countries. Different control strategies are implemented which aim to both suppress COVID-19 outbreaks when they occur as well as relax restrictions when possible. We demonstrate that a controller that responds to the effective reproduction number in addition to case numbers is more efficient and effective in controlling new waves than monitoring case numbers alone. This is of interest as the majority of the public discussion and well-known statistics deal primarily with case numbers.

preprint2021arXiv

Evidence suggests that SARS-CoV-2 rapid antigen tests provide benefits for epidemic control -- observations from Austrian schools

Rapid antigen tests detect proteins at the surface of virus particles, identifying the disease during its infectious phase. In contrast, PCR tests detect viral genomes; they can thus diagnose COVID-19 before the infectious phase but also react to remnants of the virus genome, even weeks after live virus ceases to be detectable in the respiratory tract. Furthermore, the logistics for administering the tests are different, with rapid antigen tests being much easier to administer at-scale. In this article, we discuss the relative advantages of the different testing procedures and summarise evidence that shows that using antigen tests 2-3 times per week could become a powerful tool to suppress the COVID-19 pandemic. We also discuss the results of recent large-scale rapid antigen testing in Austrian schools. While our report on testing predates Delta, we have updated the review with recent data on viral loads in breakthrough infections and more information about testing efficacy, especially in children.

preprint2020arXiv

Adaptive, Distribution-Free Prediction Intervals for Deep Networks

The machine learning literature contains several constructions for prediction intervals that are intuitively reasonable but ultimately ad-hoc in that they do not come with provable performance guarantees. We present methods from the statistics literature that can be used efficiently with neural networks under minimal assumptions with guaranteed performance. We propose a neural network that outputs three values instead of a single point estimate and optimizes a loss function motivated by the standard quantile regression loss. We provide two prediction interval methods with finite sample coverage guarantees solely under the assumption that the observations are independent and identically distributed. The first method leverages the conformal inference framework and provides average coverage. The second method provides a new, stronger guarantee by conditioning on the observed data. Lastly, our loss function does not compromise the predictive accuracy of the network like other prediction interval methods. We demonstrate the ease of use of our procedures as well as its improvements over other methods on both simulated and real data. As most deep networks can easily be modified by our method to output predictions with valid prediction intervals, its use should become standard practice, much like reporting standard errors along with mean estimates.

preprint2020arXiv

Fitting High-Dimensional Interaction Models with Error Control

There is a renewed interest in polynomial regression in the form of identifying influential interactions between features. In many settings, this takes place in a high-dimensional model, making the number of interactions unwieldy or computationally infeasible. Furthermore, it is difficult to analyze such spaces directly as they are often highly correlated. Standard feature selection issues remain such as how to determine a final model which generalizes well. This paper solves these problems with a sequential algorithm called Revisiting Alpha-Investing (RAI). RAI is motivated by the principle of marginality and searches the feature-space of higher-order interactions by greedily building upon lower-order terms. RAI controls a notion of false rejections and comes with a performance guarantee relative to the best-subset model. This ensures that signal is identified while providing a valid stopping criterion to prevent over-selection. We apply RAI in a novel setting over a family of regressions in order to select gene-specific interaction models for differential expression profiling.

preprint2016arXiv

Submodularity in Statistics: Comparing the Success of Model Selection Methods

We demonstrate the usefulness of submodularity in statistics as a characterization of the difficulty of the \emph{search} problem of feature selection. The search problem is the ability of a procedure to identify an informative set of features as opposed to the performance of the optimal set of features. Submodularity arises naturally in this setting due to its connection to combinatorial optimization. In statistics, submodularity isolates cases in which collinearity makes the choice of model features difficult from those in which this task is routine. Researchers often report the signal-to-noise ratio to measure the difficulty of simulated data examples. A measure of submodularity should also be provided as it characterizes an independent component difficulty. Furthermore, it is closely related to other statistical assumptions used in the development of the Lasso, Dantzig selector, and sure information screening.

preprint2015arXiv

A Risk Ratio Comparison of $l_0$ and $l_1$ Penalized Regression

There has been an explosion of interest in using $l_1$-regularization in place of $l_0$-regularization for feature selection. We present theoretical results showing that while $l_1$-penalized linear regression never outperforms $l_0$-regularization by more than a constant factor, in some cases using an $l_1$ penalty is infinitely worse than using an $l_0$ penalty. We also show that the "optimal" $l_1$ solutions are often inferior to $l_0$ solutions found using stepwise regression. We also compare algorithms for solving these two problems and show that although solutions can be found efficiently for the $l_1$ problem, the "optimal" $l_1$ solutions are often inferior to $l_0$ solutions found using greedy classic stepwise regression. Furthermore, we show that solutions obtained by solving the convex $l_1$ problem can be improved by selecting the best of the $l_1$ models (for different regularization penalties) by using an $l_0$ criterion. In other words, an approximate solution to the right problem can be better than the exact solution to the wrong problem.