Researcher profile

Tim Kraska

Tim Kraska contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
20works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

20 published item(s)

preprint2022arXiv

ExSample: Efficient Searches on Video Repositories through Adaptive Sampling

Capturing and processing video is increasingly common as cameras become cheaper to deploy. At the same time, rich video understanding methods have progressed greatly in the last decade. As a result, many organizations now have massive repositories of video data, with applications in mapping, navigation, autonomous driving, and other areas. Because state-of-the-art object detection methods are slow and expensive, our ability to process even simple ad-hoc object search queries ('find 100 traffic lights in dashcam video') over this accumulated data lags far behind our ability to collect it. Processing video at reduced sampling rates is a reasonable default strategy for these types of queries, however, the ideal sampling rate is both data and query dependent. We introduce ExSample, a low cost framework for object search over unindexed video that quickly processes search queries by adapting the amount and location of sampled frames to the particular data and query being processed. ExSample prioritizes the processing of frames in a video repository so that processing is focused in portions of video that most likely contain objects of interest. It continually re-prioritizes processing based on feedback from previously processed frames. On large, real-world datasets, ExSample reduces processing time by up to 6x over an efficient random sampling baseline and by several orders of magnitude over state-of-the-art methods that train specialized per-query surrogate models. ExSample is thus a key component in building cost-efficient video data management systems.

preprint2022arXiv

LSI: A Learned Secondary Index Structure

Learned index structures have been shown to achieve favorable lookup performance and space consumption compared to their traditional counterparts such as B-trees. However, most learned index studies have focused on the primary indexing setting, where the base data is sorted. In this work, we investigate whether learned indexes sustain their advantage in the secondary indexing setting. We introduce Learned Secondary Index (LSI), a first attempt to use learned indexes for indexing unsorted data. LSI works by building a learned index over a permutation vector, which allows binary search to performed on the unsorted base data using random access. We additionally augment LSI with a fingerprint vector to accelerate equality lookups. We show that LSI achieves comparable lookup performance to state-of-the-art secondary indexes while being up to 6x more space efficient.

preprint2022arXiv

The Case for Learned In-Memory Joins

In-memory join is an essential operator in any database engine. It has been extensively investigated in the database literature. In this paper, we study whether exploiting the CDF-based learned models to boost the join performance is practical or not. To the best of our knowledge, we are the first to fill this gap. We investigate the usage of CDF-based partitioning and learned indexes (e.g., Recursive Model Indexes (RMI) and RadixSpline) in the three join categories; indexed nested loop join (INLJ), sort-based joins (SJ) and hash-based joins (HJ). Our study shows that there is a room to improve the performance of INLJ and SJ categories through our proposed optimized learned variants. Our experimental analysis showed that these proposed learned variants of INLJ and SJ consistently outperform the state-of-the-art techniques.

preprint2021arXiv

Flow-Loss: Learning Cardinality Estimates That Matter

Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the optimizer's cost model and dynamic programming search algorithm with analytical functions. At the heart of Flow-Loss is a reduction of query optimization to a flow routing problem on a certain plan graph in which paths correspond to different query plans. To evaluate our approach, we introduce the Cardinality Estimation Benchmark, which contains the ground truth cardinalities for sub-plans of over 16K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with Flow-Loss improves the cost of plans (using the PostgreSQL cost model) and query runtimes despite having worse estimation accuracy than a model trained with Q-Error. When the test set queries closely match the training queries, both models improve performance significantly over PostgreSQL and are close to the optimal performance (using true cardinalities). However, the Q-Error trained model degrades significantly when evaluated on queries that are slightly different (e.g., similar but not identical query templates), while the Flow-Loss trained model generalizes better to such situations. For example, the Flow-Loss model achieves up to 1.5x better runtimes on unseen templates compared to the Q-Error model, despite leveraging the same model architecture and training data.

preprint2020arXiv

ALEX: An Updatable Adaptive Learned Index

Recent work on "learned indexes" has changed the way we look at the decades-old field of DBMS indexing. The key idea is that indexes can be thought of as "models" that predict the position of a key in a dataset. Indexes can, thus, be learned. The original work by Kraska et al. shows that a learned index beats a B+Tree by a factor of up to three in search time and by an order of magnitude in memory footprint. However, it is limited to static, read-only workloads. In this paper, we present a new learned index called ALEX which addresses practical issues that arise when implementing learned indexes for workloads that contain a mix of point lookups, short range queries, inserts, updates, and deletes. ALEX effectively combines the core insights from learned indexes with proven storage and indexing techniques to achieve high performance and low memory footprint. On read-only workloads, ALEX beats the learned index from Kraska et al. by up to 2.2X on performance with up to 15X smaller index size. Across the spectrum of read-write workloads, ALEX beats B+Trees by up to 4.1X while never performing worse, with up to 2000X smaller index size. We believe ALEX presents a key step towards making learned indexes practical for a broader class of database workloads with dynamic updates.

preprint2020arXiv

ARDA: Automatic Relational Data Augmentation for Machine Learning

Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal ``human-in-the-loop'' involvement. We present \system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.

preprint2020arXiv

Chiller: Contention-centric Transaction Execution and Data Partitioning for Modern Networks

Distributed transactions on high-overhead TCP/IP-based networks were conventionally considered to be prohibitively expensive and thus were avoided at all costs. To that end, the primary goal of almost any existing partitioning scheme is to minimize the number of cross-partition transactions. However, with the new generation of fast RDMA-enabled networks, this assumption is no longer valid. In fact, recent work has shown that distributed databases can scale even when the majority of transactions are cross-partition. In this paper, we first make the case that the new bottleneck which hinders truly scalable transaction processing in modern RDMA-enabled databases is data contention, and that optimizing for data contention leads to different partitioning layouts than optimizing for the number of distributed transactions. We then present Chiller, a new approach to data partitioning and transaction execution, which aims to minimize data contention for both local and distributed transactions. Finally, we evaluate Chiller using various workloads, and show that our partitioning and execution strategy outperforms traditional partitioning techniques which try to avoid distributed transactions, by up to a factor of 2.

preprint2020arXiv

Context-Aware Parse Trees

The simplified parse tree (SPT) presented in Aroma, a state-of-the-art code recommendation system, is a tree-structured representation used to infer code semantics by capturing program \emph{structure} rather than program \emph{syntax}. This is a departure from the classical abstract syntax tree, which is principally driven by programming language syntax. While we believe a semantics-driven representation is desirable, the specifics of an SPT's construction can impact its performance. We analyze these nuances and present a new tree structure, heavily influenced by Aroma's SPT, called a \emph{context-aware parse tree} (CAPT). CAPT enhances SPT by providing a richer level of semantic representation. Specifically, CAPT provides additional binding support for language-specific techniques for adding semantically-salient features, and language-agnostic techniques for removing syntactically-present but semantically-irrelevant features. Our research quantitatively demonstrates the value of our proposed semantically-salient features, enabling a specific CAPT configuration to be 39\% more accurate than SPT across the 48,610 programs we analyzed.

preprint2020arXiv

DBOS: A Proposal for a Data-Centric Operating System

Current operating systems are complex systems that were designed before today's computing environments. This makes it difficult for them to meet the scalability, heterogeneity, availability, and security challenges in current cloud and parallel computing environments. To address these problems, we propose a radically new OS design based on data-centric architecture: all operating system state should be represented uniformly as database tables, and operations on this state should be made via queries from otherwise stateless tasks. This design makes it easy to scale and evolve the OS without whole-system refactoring, inspect and debug system state, upgrade components without downtime, manage decisions using machine learning, and implement sophisticated security features. We discuss how a database OS (DBOS) can improve the programmability and performance of many of today's most important applications and propose a plan for the development of a DBOS proof of concept.

preprint2020arXiv

FITing-Tree: A Data-aware Index Structure

Index structures are one of the most important tools that DBAs leverage to improve the performance of analytics and transactional workloads. However, building several indexes over large datasets can often become prohibitive and consume valuable system resources. In fact, a recent study showed that indexes created as part of the TPC-C benchmark can account for 55% of the total memory available in a modern DBMS. This overhead consumes valuable and expensive main memory, and limits the amount of space available to store new data or process existing data. In this paper, we present FITing-Tree, a novel form of a learned index which uses piece-wise linear functions with a bounded error specified at construction time. This error knob provides a tunable parameter that allows a DBA to FIT an index to a dataset and workload by being able to balance lookup performance and space consumption. To navigate this tradeoff, we provide a cost model that helps determine an appropriate error parameter given either (1) a lookup latency requirement (e.g., 500ns) or (2) a storage budget (e.g., 100MB). Using a variety of real-world datasets, we show that our index is able to provide performance that is comparable to full index structures while reducing the storage footprint by orders of magnitude.

preprint2020arXiv

RadixSpline: A Single-Pass Learned Index

Recent research has shown that learned models can outperform state-of-the-art index structures in size and lookup performance. While this is a very promising result, existing learned structures are often cumbersome to implement and are slow to build. In fact, most approaches that we are aware of require multiple training passes over the data. We introduce RadixSpline (RS), a learned index that can be built in a single pass over the data and is competitive with state-of-the-art learned index models, like RMI, in size and lookup performance. We evaluate RS using the SOSD benchmark and show that it achieves competitive results on all datasets, despite the fact that it only has two parameters.

preprint2020arXiv

Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads

Filtering data based on predicates is one of the most fundamental operations for any modern data warehouse. Techniques to accelerate the execution of filter expressions include clustered indexes, specialized sort orders (e.g., Z-order), multi-dimensional indexes, and, for high selectivity queries, secondary indexes. However, these schemes are hard to tune and their performance is inconsistent. Recent work on learned multi-dimensional indexes has introduced the idea of automatically optimizing an index for a particular dataset and workload. However, the performance of that work suffers in the presence of correlated data and skewed query workloads, both of which are common in real applications. In this paper, we introduce Tsunami, which addresses these limitations to achieve up to 6X faster query performance and up to 8X smaller index size than existing learned multi-dimensional indexes, in addition to up to 11X faster query performance and 170X smaller index size than optimally-tuned traditional indexes.

preprint2019arXiv

Learning Multi-dimensional Indexes

Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often create clustered indexes over a single dimension or multi-dimensional indexes such as R-trees, or use complex sort orders (e.g., Z-ordering). However, these schemes are often hard to tune and their performance is inconsistent across different datasets and queries. In this paper, we introduce Flood, a multi-dimensional in-memory index that automatically adapts itself to a particular dataset and workload by jointly optimizing the index structure and data storage. Flood achieves up to three orders of magnitude faster performance for range scans with predicates than state-of-the-art multi-dimensional indexes or sort orders on real-world datasets and workloads. Our work serves as a building block towards an end-to-end learned database system.

preprint2019arXiv

Neo: A Learned Query Optimizer

Query optimization is one of the most challenging problems in database systems. Despite the progress made over the past decades, query optimizers remain extremely complex components that require a great deal of hand-tuning for specific workloads and datasets. Motivated by this shortcoming and inspired by recent advances in applying machine learning to data management challenges, we introduce Neo (Neural Optimizer), a novel learning-based query optimizer that relies on deep neural networks to generate query executions plans. Neo bootstraps its query optimization model from existing optimizers and continues to learn from incoming queries, building upon its successes and learning from its failures. Furthermore, Neo naturally adapts to underlying data patterns and is robust to estimation errors. Experimental results demonstrate that Neo, even when bootstrapped from a simple optimizer like PostgreSQL, can learn a model that offers similar performance to state-of-the-art commercial optimizers, and in some cases even surpass them.

preprint2014arXiv

Tupleware: Redefining Modern Analytics

There is a fundamental discrepancy between the targeted and actual users of current analytics frameworks. Most systems are designed for the data and infrastructure of the Googles and Facebooks of the world---petabytes of data distributed across large cloud deployments consisting of thousands of cheap commodity machines. Yet, the vast majority of users operate clusters ranging from a few to a few dozen nodes, analyze relatively small datasets of up to a few terabytes, and perform primarily compute-intensive operations. Targeting these users fundamentally changes the way we should build analytics systems. This paper describes the design of Tupleware, a new system specifically aimed at the challenges faced by the typical user. Tupleware's architecture brings together ideas from the database, compiler, and programming languages communities to create a powerful end-to-end solution for data analysis. We propose novel techniques that consider the data, computations, and hardware together to achieve maximum performance on a case-by-case basis. Our experimental evaluation quantifies the impact of our novel techniques and shows orders of magnitude performance improvement over alternative systems.

preprint2013arXiv

MLI: An API for Distributed Machine Learning

MLI is an Application Programming Interface designed to address the challenges of building Machine Learn- ing algorithms in a distributed setting based on data-centric computing. Its primary goal is to simplify the development of high-performance, scalable, distributed algorithms. Our initial results show that, relative to existing systems, this interface can be used to build distributed implementations of a wide variety of common Machine Learning algorithms with minimal complexity and highly competitive performance and scalability.

preprint2012arXiv

CrowdER: Crowdsourcing Entity Resolution

Entity resolution is central to data integration and data cleaning. Algorithmic approaches have been improving in quality, but remain far from perfect. Crowdsourcing platforms offer a more accurate but expensive (and slow) way to bring human insight into the process. Previous work has proposed batching verification tasks for presentation to human workers but even with batching, a human-only approach is infeasible for data sets of even moderate size, due to the large numbers of matches to be tested. Instead, we propose a hybrid human-machine approach in which machines are used to do an initial, coarse pass over all the data, and people are used to verify only the most likely matching pairs. We show that for such a hybrid system, generating the minimum number of verification tasks of a given size is NP-Hard, but we develop a novel two-tiered heuristic approach for creating batched tasks. We describe this method, and present the results of extensive experiments on real data sets using a popular crowdsourcing platform. The experiments show that our hybrid approach achieves both good efficiency and high accuracy compared to machine-only or human-only alternatives.

preprint2012arXiv

Getting It All from the Crowd

Hybrid human/computer systems promise to greatly expand the usefulness of query processing by incorporating the crowd for data gathering and other tasks. Such systems raise many database system implementation questions. Perhaps most fundamental is that the closed world assumption underlying relational query semantics does not hold in such systems. As a consequence the meaning of even simple queries can be called into question. Furthermore query progress monitoring becomes difficult due to non-uniformities in the arrival of crowdsourced data and peculiarities of how people work in crowdsourcing systems. To address these issues, we develop statistical tools that enable users and systems developers to reason about tradeoffs between time/cost and completeness. These tools can also help drive query execution and crowdsourcing strategies. We evaluate our techniques using experiments on a popular crowdsourcing platform.

preprint2012arXiv

MDCC: Multi-Data Center Consistency

Replicating data across multiple data centers not only allows moving the data closer to the user and, thus, reduces latency for applications, but also increases the availability in the event of a data center failure. Therefore, it is not surprising that companies like Google, Yahoo, and Netflix already replicate user data across geographically different regions. However, replication across data centers is expensive. Inter-data center network delays are in the hundreds of milliseconds and vary significantly. Synchronous wide-area replication is therefore considered to be unfeasible with strong consistency and current solutions either settle for asynchronous replication which implies the risk of losing data in the event of failures, restrict consistency to small partitions, or give up consistency entirely. With MDCC (Multi-Data Center Consistency), we describe the first optimistic commit protocol, that does not require a master or partitioning, and is strongly consistent at a cost similar to eventually consistent protocols. MDCC can commit transactions in a single round-trip across data centers in the normal operational case. We further propose a new programming model which empowers the application developer to handle longer and unpredictable latencies caused by inter-data center communication. Our evaluation using the TPC-W benchmark with MDCC deployed across 5 geographically diverse data centers shows that MDCC is able to achieve throughput and latency similar to eventually consistent quorum protocols and that MDCC is able to sustain a data center outage without a significant impact on response times while guaranteeing strong consistency.

preprint2011arXiv

PIQL: Success-Tolerant Query Processing in the Cloud

Newly-released web applications often succumb to a "Success Disaster," where overloaded database machines and resulting high response times destroy a previously good user experience. Unfortunately, the data independence provided by a traditional relational database system, while useful for agile development, only exacerbates the problem by hiding potentially expensive queries under simple declarative expressions. As a result, developers of these applications are increasingly abandoning relational databases in favor of imperative code written against distributed key/value stores, losing the many benefits of data independence in the process. Instead, we propose PIQL, a declarative language that also provides scale independence by calculating an upper bound on the number of key/value store operations that will be performed for any query. Coupled with a service level objective (SLO) compliance prediction model and PIQL's scalable database architecture, these bounds make it easy for developers to write success-tolerant applications that support an arbitrarily large number of users while still providing acceptable performance. In this paper, we present the PIQL query processing system and evaluate its scale independence on hundreds of machines using two benchmarks, TPC-W and SCADr.