Source author record

Souvik Bhattacherjee

Souvik Bhattacherjee appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Databases Information Retrieval

Catalog footprint

What is connected

6works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2015arXiv

Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of $versions$ of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods of time. Managing, storing, and recreating these dataset versions is a non-trivial task. The fundamental challenge here is the $storage-recreation\;trade-off$: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. Despite the fundamental nature of this problem, there has been a surprisingly little amount of work on it. In this paper, we study this trade-off in a principled manner: we formulate six problems under various settings, trading off these quantities in various ways, demonstrate that most of the problems are intractable, and propose a suite of inexpensive heuristics drawing from techniques in delay-constrained scheduling, and spanning tree literature, to solve these problems. We have built a prototype version management system, that aims to serve as a foundation to our DATAHUB system for facilitating collaborative data science. We demonstrate, via extensive experiments, that our proposed heuristics provide efficient solutions in practical dataset versioning scenarios.

preprint2014arXiv

DataHub: Collaborative Data Science & Dataset Version Management at Scale

Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.

preprint2012arXiv

Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams

Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse sources. De-duplication or Intelligent Compression in streaming scenarios for approximate identification and elimination of duplicates from such unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) addresses this problem to a certain extent. . In this work, we present several novel algorithms for the problem of approximate detection of duplicates in data streams. We propose the Reservoir Sampling based Bloom Filter (RSBF) combining the working principle of reservoir sampling and Bloom Filters. We also present variants of the novel Biased Sampling based Bloom Filter (BSBF) based on biased sampling concepts. We also propose a randomized load balanced variant of the sampling Bloom Filter approach to efficiently tackle the duplicate detection. In this work, we thus provide a generic framework for de-duplication using Bloom Filters. Using detailed theoretical analysis we prove analytical bounds on the false positive rate, false negative rate and convergence rate of the proposed structures. We exhibit that our models clearly outperform the existing methods. We also demonstrate empirical analysis of the structures using real-world datasets (3 million records) and also with synthetic datasets (1 billion records) capturing various input distributions.

preprint2011arXiv

Multidimensional Balanced Allocation for Multiple Choice & (1 + Beta) Processes

Allocation of balls into bins is a well studied abstraction for load balancing problems.The literature hosts numerous results for sequential(single dimensional) allocation case when m balls are thrown into n bins. In this paper we study the symmetric multiple choice process for both unweighted and weighted balls as well as for both multidimensional and scalar models.Additionally,we present the results on bounds on gap for (1+beta) choice process with multidimensional balls and bins. We show that for the symmetric d choice process and with m=O(n), the upper bound on the gap is O(lnln(n)) w.h.p.This upper bound on the gap is within D=f factor of the lower bound. This is the first such tight result.For the general case of m>>n the expected gap is bounded by O(lnln(n)).For variable f and non-uniform distribution of the populated dimensions,we obtain the upper bound on the expected gap as O(log(n)). Further,for the multiple round parallel balls and bins,we show that the gap is also bounded by O(loglog(n)) for m=O(n).The same bound holds for the expected gap when m>>n. Our analysis also has strong implications in the sequential scalar case.For the weighted balls and bins and general case m>>n,we show that the upper bound on the expected gap is O(log(n)) which improves upon the best prior bound of n^c.Moreover,we show that for the (1 + beta) choice process and m=O(n) the upper bound(assuming uniform distribution of f populated dimensions over D total dimensions) on the gap is O(log(n)/beta),which is within D=f factor of the lower bound.For fixed f with non-uniform distribution and for random f with Binomial distribution the expected gap remains O(log(n)/beta) independent of the total number of balls thrown. This is the first such tight result for (1 +beta) paradigm with multidimensional balls and bins.

preprint2011arXiv

Perfectly Balanced Allocation With Estimated Average Using Expected Constant Retries

Balanced allocation of online balls-into-bins has long been an active area of research for efficient load balancing and hashing applications.There exists a large number of results in this domain for different settings, such as parallel allocations~\cite{parallel}, multi-dimensional allocations~\cite{multi}, weighted balls~\cite{weight} etc. For sequential multi-choice allocation, where $m$ balls are thrown into $n$ bins with each ball choosing $d$ (constant) bins independently uniformly at random, the maximum load of a bin is $O(\log \log n) + m/n$ with high probability~\cite{heavily_load}. This offers the current best known allocation scheme. However, for $d = Θ(\log n)$, the gap reduces to $O(1)$~\cite{soda08}.A similar constant gap bound has been established for parallel allocations with $O(\log ^*n)$ communication rounds~\cite{lenzen}. In this paper we propose a novel multi-choice allocation algorithm, \emph{Improved D-choice with Estimated Average} ($IDEA$) achieving a constant gap with a high probability for the sequential single-dimensional online allocation problem with constant $d$. We achieve a maximum load of $\lceil m/n \rceil$ with high probability for constant $d$ choice scheme with \emph{expected} constant number of retries or rounds per ball. We also show that the bound holds even for an arbitrary large number of balls, $m>>n$. Further, we generalize this result to (i)~the weighted case, where balls have weights drawn from an arbitrary weight distribution with finite variance, (ii)~multi-dimensional setting, where balls have $D$ dimensions with $f$ randomly and uniformly chosen filled dimension for $m=n$, and (iii)~the parallel case, where $n$ balls arrive and are placed parallely in the bins. We show that the gap in these case is also a constant w.h.p. (independent of $m$) for constant value of $d$ with expected constant number of retries per ball.

preprint2011arXiv

Towards "Intelligent Compression" in Streams: A Biased Reservoir Sampling based Bloom Filter Approach

With the explosion of information stored world-wide,data intensive computing has become a central area of research.Efficient management and processing of this massively exponential amount of data from diverse sources,such as telecommunication call data records,online transaction records,etc.,has become a necessity.Removing redundancy from such huge(multi-billion records) datasets resulting in resource and compute efficiency for downstream processing constitutes an important area of study. "Intelligent compression" or deduplication in streaming scenarios,for precise identification and elimination of duplicates from the unbounded datastream is a greater challenge given the realtime nature of data arrival.Stable Bloom Filters(SBF) address this problem to a certain extent.However,SBF suffers from a high false negative rate(FNR) and slow convergence rate,thereby rendering it inefficient for applications with low FNR tolerance.In this paper, we present a novel Reservoir Sampling based Bloom Filter,(RSBF) data structure,based on the combined concepts of reservoir sampling and Bloom filters for approximate detection of duplicates in data streams.Using detailed theoretical analysis we prove analytical bounds on its false positive rate(FPR),false negative rate(FNR) and convergence rates with low memory requirements.We show that RSBF offers the currently lowest FN and convergence rates,and are better than those of SBF while using the same memory.Using empirical analysis on real-world datasets(3 million records) and synthetic datasets with around 1 billion records,we demonstrate upto 2x improvement in FNR with better convergence rates as compared to SBF,while exhibiting comparable FPR.To the best of our knowledge,this is the first attempt to integrate reservoir sampling method with Bloom filters for deduplication in streaming scenarios.

Souvik Bhattacherjee

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff

DataHub: Collaborative Data Science & Dataset Version Management at Scale

Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams

Multidimensional Balanced Allocation for Multiple Choice & (1 + Beta) Processes

Perfectly Balanced Allocation With Estimated Average Using Expected Constant Retries

Towards "Intelligent Compression" in Streams: A Biased Reservoir Sampling based Bloom Filter Approach