Source author record

Vikram Nathan

Vikram Nathan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Databases Computational Complexity Data Structures and Algorithms

Catalog footprint

What is connected

4works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads

Filtering data based on predicates is one of the most fundamental operations for any modern data warehouse. Techniques to accelerate the execution of filter expressions include clustered indexes, specialized sort orders (e.g., Z-order), multi-dimensional indexes, and, for high selectivity queries, secondary indexes. However, these schemes are hard to tune and their performance is inconsistent. Recent work on learned multi-dimensional indexes has introduced the idea of automatically optimizing an index for a particular dataset and workload. However, the performance of that work suffers in the presence of correlated data and skewed query workloads, both of which are common in real applications. In this paper, we introduce Tsunami, which addresses these limitations to achieve up to 6X faster query performance and up to 8X smaller index size than existing learned multi-dimensional indexes, in addition to up to 11X faster query performance and 170X smaller index size than optimally-tuned traditional indexes.

preprint2019arXiv

Learning Multi-dimensional Indexes

Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often create clustered indexes over a single dimension or multi-dimensional indexes such as R-trees, or use complex sort orders (e.g., Z-ordering). However, these schemes are often hard to tune and their performance is inconsistent across different datasets and queries. In this paper, we introduce Flood, a multi-dimensional in-memory index that automatically adapts itself to a particular dataset and workload by jointly optimizing the index structure and data storage. Flood achieves up to three orders of magnitude faster performance for range scans with predicates than state-of-the-art multi-dimensional indexes or sort orders on real-world datasets and workloads. Our work serves as a building block towards an end-to-end learned database system.

preprint2016arXiv

Hardness of Peeling with Stashes

The analysis of several algorithms and data structures can be framed as a peeling process on a random hypergraph: vertices with degree less than k and their adjacent edges are removed until no vertices of degree less than k are left. Often the question is whether the remaining hypergraph, the k-core, is empty or not. In some settings, it may be possible to remove either vertices or edges from the hypergraph before peeling, at some cost. For example, in hashing applications where keys correspond to edges and buckets to vertices, one might use an additional side data structure, commonly referred to as a stash, to separately handle some keys in order to avoid collisions. The natural question in such cases is to find the minimum number of edges (or vertices) that need to be stashed in order to realize an empty k-core. We show that both these problems are NP-complete for all $k \geq 2$ on graphs and regular hypergraphs, with the sole exception being that the edge variant of stashing is solvable in polynomial time for $k = 2$ on standard (2-uniform) graphs.

preprint2014arXiv

Accurate Streaming Support Vector Machines

A widely-used tool for binary classification is the Support Vector Machine (SVM), a supervised learning technique that finds the "maximum margin" linear separator between the two classes. While SVMs have been well studied in the batch (offline) setting, there is considerably less work on the streaming (online) setting, which requires only a single pass over the data using sub-linear space. Existing streaming algorithms are not yet competitive with the batch implementation. In this paper, we use the formulation of the SVM as a minimum enclosing ball (MEB) problem to provide a streaming SVM algorithm based off of the blurred ball cover originally proposed by Agarwal and Sharathkumar. Our implementation consistently outperforms existing streaming SVM approaches and provides higher accuracies than libSVM on several datasets, thus making it competitive with the standard SVM batch implementation.