Source author record

Peter Kraft

Peter Kraft appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Methodology Distributed, Parallel, and Cluster Computing Machine Learning Cryptography and Security Human-Computer Interaction Populations and Evolution Software Engineering

Catalog footprint

What is connected

7works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Machine Learning with DBOS

We recently proposed a new cluster operating system stack, DBOS, centered on a DBMS. DBOS enables unique support for ML applications by encapsulating ML code within stored procedures, centralizing ancillary ML data, providing security built into the underlying DBMS, co-locating ML code and data, and tracking data and workflow provenance. Here we demonstrate a subset of these benefits around two ML applications. We first show that image classification and object detection models using GPUs can be served as DBOS stored procedures with performance competitive to existing systems. We then present a 1D CNN trained to detect anomalies in HTTP requests on DBOS-backed web services, achieving SOTA results. We use this model to develop an interactive anomaly detection system and evaluate it through qualitative user feedback, demonstrating its usefulness as a proof of concept for future work to develop learned real-time security services on top of DBOS.

preprint2022arXiv

Transactions Make Debugging Easy

We propose TROD, a novel transaction-oriented framework for debugging modern distributed web applications and online services. Our critical insight is that if applications store all state in databases and only access state transactionally, TROD can use lightweight always-on tracing to track the history of application state changes and data provenance, and then leverage the captured traces and transaction logs to faithfully replay or even test modified code retroactively on any past event. We demonstrate how TROD can simplify programming and debugging in production applications, list several research challenges and directions, and encourage the database and systems communities to drastically rethink the synergy between the way people develop and debug applications.

preprint2021arXiv

Getting Genetic Ancestry Right for Science and Society

There is a scientific and ethical imperative to embrace a multidimensional, continuous view of ancestry and move away from continental ancestry categories

preprint2020arXiv

Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference

Systems for ML inference are widely deployed today, but they typically optimize ML inference workloads using techniques designed for conventional data serving workloads and miss critical opportunities to leverage the statistical nature of ML. In this paper, we present Willump, an optimizer for ML inference that introduces two statistically-motivated optimizations targeting ML applications whose performance bottleneck is feature computation. First, Willump automatically cascades feature computation for classification queries: Willump classifies most data inputs using only high-value, low-cost features selected through empirical observations of ML model performance, improving query performance by up to 5x without statistically significant accuracy loss. Second, Willump accurately approximates ML top-K queries, discarding low-scoring inputs with an automatically constructed approximate model and then ranking the remainder with a more powerful model, improving query performance by up to 10x with minimal accuracy loss. Willump automatically tunes these optimizations' parameters to maximize query performance while meeting an accuracy target. Moreover, Willump complements these statistical optimizations with compiler optimizations to automatically generate fast inference code for ML applications. We show that Willump improves the end-to-end performance of real-world ML inference pipelines curated from major data science competitions by up to 16x without statistically significant loss of accuracy.

preprint2014arXiv

Control Function Assisted IPW Estimation with a Secondary Outcome in Case-Control Studies

Case-control studies are designed towards studying associations between risk factors and a single, primary outcome. Information about additional, secondary outcomes is also collected, but association studies targeting such secondary outcomes should account for the case-control sampling scheme, or otherwise results may be biased. Often, one uses inverse probability weighted (IPW) estimators to estimate population effects in such studies. However, these estimators are inefficient relative to estimators that make additional assumptions about the data generating mechanism. We propose a class of estimators for the effect of risk factors on a secondary outcome in case-control studies, when the mean is modeled using either the identity or the log link. The proposed estimator combines IPW with a mean zero control function that depends explicitly on a model for the primary disease outcome. The efficient estimator in our class of estimators reduces to standard IPW when the model for the primary disease outcome is unrestricted, and is more efficient than standard IPW when the model is either parametric or semiparametric.

preprint2013arXiv

Maximizing the Power of Principal Components Analysis of Correlated Phenotypes in Genome-wide Association Studies

Principal Component analysis (PCA) is a useful statistical technique that is commonly used for multivariate analysis of correlated variables. It is usually applied as a dimension reduction method: the top principal components (PCs) explaining most of total variance are tested for association with a predictor of interest, and the remaining PCs are ignored. This strategy has been widely applied in genetic epidemiology, however some of its aspects are not well appreciated in the context of single nucleotide polymorphisms (SNPs) association testing. In this study, we review the theoretical basis of PCA and its behavior when testing for association between a SNP and two correlated traits under various scenarios. We then evaluate with simulations the power of several different PCA-based strategies when analyzing up to 100 correlated traits. We show that contrary to widespread practice that testing the top PCs only can be dramatically underpowered since PCs explaining a low amount of the total phenotypic variance can harbor substantial genetic associations. Furthermore, we demonstrate that PC-based strategies that use all PCs have great potential to detect negatively pleiotropic genetic variants (e.g. variants with opposite effects on positively correlated traits) and genetic variants that are exclusively associated with a single trait, but only achieve a moderate gain in power to detect positive pleiotropic genetic loci. Finally, the genome-wide association study of five correlated coagulation traits in 685 subjects from the MARTHA study confirms these results. The joint analysis of the five PCs from the coagulation traits identified two new candidate SNPs, which were most strongly associated with the 5th PC that explained the smallest amount of phenotypic variance.

preprint2010arXiv

Replication in Genome-Wide Association Studies

Replication helps ensure that a genotype-phenotype association observed in a genome-wide association (GWA) study represents a credible association and is not a chance finding or an artifact due to uncontrolled biases. We discuss prerequisites for exact replication, issues of heterogeneity, advantages and disadvantages of different methods of data synthesis across multiple studies, frequentist vs. Bayesian inferences for replication, and challenges that arise from multi-team collaborations. While consistent replication can greatly improve the credibility of a genotype-phenotype association, it may not eliminate spurious associations due to biases shared by many studies. Conversely, lack of replication in well-powered follow-up studies usually invalidates the initially proposed association, although occasionally it may point to differences in linkage disequilibrium or effect modifiers across studies.