Source author record

Harsha Vardhan Simhadri

Harsha Vardhan Simhadri appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Machine Learning Data Structures and Algorithms Databases Performance

Catalog footprint

What is connected

4works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Results of the NeurIPS'21 Challenge on Billion-Scale Approximate Nearest Neighbor Search

Despite the broad range of algorithms for Approximate Nearest Neighbor Search, most empirical evaluations of algorithms have focused on smaller datasets, typically of 1 million points~\citep{Benchmark}. However, deploying recent advances in embedding based techniques for search, recommendation and ranking at scale require ANNS indices at billion, trillion or larger scale. Barring a few recent papers, there is limited consensus on which algorithms are effective at this scale vis-à-vis their hardware cost. This competition compares ANNS algorithms at billion-scale by hardware cost, accuracy and performance. We set up an open source evaluation framework and leaderboards for both standardized and specialized hardware. The competition involves three tracks. The standard hardware track T1 evaluates algorithms on an Azure VM with limited DRAM, often the bottleneck in serving billion-scale indices, where the embedding data can be hundreds of GigaBytes in size. It uses FAISS~\citep{Faiss17} as the baseline. The standard hardware track T2 additional allows inexpensive SSDs in addition to the limited DRAM and uses DiskANN~\citep{DiskANN19} as the baseline. The specialized hardware track T3 allows any hardware configuration, and again uses FAISS as the baseline. We compiled six diverse billion-scale datasets, four newly released for this competition, that span a variety of modalities, data types, dimensions, deep learning models, distance functions and sources. The outcome of the competition was ranked leaderboards of algorithms in each track based on recall at a query throughput threshold. Additionally, for track T3, separate leaderboards were created based on recall as well as cost-normalized and power-normalized query throughput.

preprint2020arXiv

DROCC: Deep Robust One-Class Classification

Classical approaches for one-class problems such as one-class SVM and isolation forest require careful feature engineering when applied to structured domains like images. State-of-the-art methods aim to leverage deep learning to learn appropriate features via two main approaches. The first approach based on predicting transformations (Golan & El-Yaniv, 2018; Hendrycks et al., 2019a) while successful in some domains, crucially depends on an appropriate domain-specific set of transformations that are hard to obtain in general. The second approach of minimizing a classical one-class loss on the learned final layer representations, e.g., DeepSVDD (Ruff et al., 2018) suffers from the fundamental drawback of representation collapse. In this work, we propose Deep Robust One-Class Classification (DROCC) that is both applicable to most standard domains without requiring any side-information and robust to representation collapse. DROCC is based on the assumption that the points from the class of interest lie on a well-sampled, locally linear low dimensional manifold. Empirical evaluation demonstrates that DROCC is highly effective in two different one-class problem settings and on a range of real-world datasets across different domains: tabular data, images (CIFAR and ImageNet), audio, and time-series, offering up to 20% increase in accuracy over the state-of-the-art in anomaly detection. Code is available at https://github.com/microsoft/EdgeML.

preprint2016arXiv

Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers

The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "$\parallel$" (parallel) and "$;$" (serial), are insufficient in expressing "partial dependencies" or "partial parallelism" in a program. We propose a new dataflow composition construct "$\leadsto$" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and provide theoretical guarantees on their ability to preserve locality and load balance. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is $O\left(\frac{\sum_{i=0}^{h-1} Q^{*}({\mathsf t};σ\cdot M_i)\cdot C_i}{p}\right)$, where $Q^{*}$ is the cache complexity of task ${\mathsf t}$, $C_i$ is the cost of cache miss at level-$i$ cache which is of size $M_i$, $σ\in(0,1)$ is a constant, and $p$ is the number of processors in an $h$-level cache hierarchy.

preprint2015arXiv

Using Symmetry to Schedule Classical Matrix Multiplication

Presented with a new machine with a specific interconnect topology, algorithm designers use intuition about the symmetry of the algorithm to design time and communication-efficient schedules that map the algorithm to the machine. Is there a systematic procedure for designing schedules? We present a new technique to design schedules for algorithms with no non-trivial dependencies, focusing on the classical matrix multiplication algorithm. We model the symmetry of algorithm with the set of instructions $X$ as the action of the group formed by the compositions of bijections from the set $X$ to itself. We model the machine as the action of the group $N\times Δ$, where $N$ and $Δ$ represent the interconnect topology and time increments respectively, on the set $P\times T$ of processors iterated over time steps. We model schedules as symmetry-preserving equivariant maps between the set $X$ and a subgroup of its symmetry and the set $P\times T$ with the symmetry $N\timesΔ$. Such equivariant maps are the solutions of a set of algebraic equations involving group homomorphisms. We associate time and communication costs with the solutions to these equations. We solve these equations for the classical matrix multiplication algorithm and show that equivariant maps correspond to time- and communication-efficient schedules for many topologies. We recover well known variants including the Cannon's algorithm and the communication-avoiding "2.5D" algorithm for toroidal interconnects, systolic computation for planar hexagonal VLSI arrays, recursive algorithms for fat-trees, the cache-oblivious algorithm for the ideal cache model, and the space-bounded schedule for the parallel memory hierarchy model. This suggests that the design of a schedule for a new class of machines can be motivated by solutions to algebraic equations.