Source author record

Suman K. Bera

Suman K. Bera appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Information Retrieval Machine Learning

Catalog footprint

What is connected

5works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A New Dynamic Algorithm for Densest Subhypergraphs

Computing a dense subgraph is a fundamental problem in graph mining, with a diverse set of applications ranging from electronic commerce to community detection in social networks. In many of these applications, the underlying context is better modelled as a weighted hypergraph that keeps evolving with time. This motivates the problem of maintaining the densest subhypergraph of a weighted hypergraph in a {\em dynamic setting}, where the input keeps changing via a sequence of updates (hyperedge insertions/deletions). Previously, the only known algorithm for this problem was due to Hu et al. [HWC17]. This algorithm worked only on unweighted hypergraphs, and had an approximation ratio of $(1+ε)r^2$ and an update time of $O(\text{poly} (r, \log n))$, where $r$ denotes the maximum rank of the input across all the updates. We obtain a new algorithm for this problem, which works even when the input hypergraph is weighted. Our algorithm has a significantly improved (near-optimal) approximation ratio of $(1+ε)$ that is independent of $r$, and a similar update time of $O(\text{poly} (r, \log n))$. It is the first $(1+ε)$-approximation algorithm even for the special case of weighted simple graphs. To complement our theoretical analysis, we perform experiments with our dynamic algorithm on large-scale, real-world data-sets. Our algorithm significantly outperforms the state of the art [HWC17] both in terms of accuracy and efficiency.

preprint2020arXiv

Distributional Individual Fairness in Clustering

In this paper, we initiate the study of fair clustering that ensures distributional similarity among similar individuals. In response to improving fairness in machine learning, recent papers have investigated fairness in clustering algorithms and have focused on the paradigm of statistical parity/group fairness. These efforts attempt to minimize bias against some protected groups in the population. However, to the best of our knowledge, the alternative viewpoint of individual fairness, introduced by Dwork et al. (ITCS 2012) in the context of classification, has not been considered for clustering so far. Similar to Dwork et al., we adopt the individual fairness notion which mandates that similar individuals should be treated similarly for clustering problems. We use the notion of $f$-divergence as a measure of statistical similarity that significantly generalizes the ones used by Dwork et al. We introduce a framework for assigning individuals, embedded in a metric space, to probability distributions over a bounded number of cluster centers. The objective is to ensure (a) low cost of clustering in expectation and (b) individuals that are close to each other in a given fairness space are mapped to statistically similar distributions. We provide an algorithm for clustering with $p$-norm objective ($k$-center, $k$-means are special cases) and individual fairness constraints with provable approximation guarantee. We extend this framework to include both group fairness and individual fairness inside the protected groups. Finally, we observe conditions under which individual fairness implies group fairness. We present extensive experimental evidence that justifies the effectiveness of our approach.

preprint2020arXiv

How the Degeneracy Helps for Triangle Counting in Graph Streams

We revisit the well-studied problem of triangle count estimation in graph streams. Given a graph represented as a stream of $m$ edges, our aim is to compute a $(1\pm\varepsilon)$-approximation to the triangle count $T$, using a small space algorithm. For arbitrary order and a constant number of passes, the space complexity is known to be essentially $Θ(\min(m^{3/2}/T, m/\sqrt{T}))$ (McGregor et al., PODS 2016, Bera et al., STACS 2017). We give a (constant pass, arbitrary order) streaming algorithm that can circumvent this lower bound for \emph{low degeneracy graphs}. The degeneracy, $κ$, is a nuanced measure of density, and the class of constant degeneracy graphs is immensely rich (containing planar graphs, minor-closed families, and preferential attachment graphs). We design a streaming algorithm with space complexity $\widetilde{O}(mκ/T)$. For constant degeneracy graphs, this bound is $\widetilde{O}(m/T)$, which is significantly smaller than both $m^{3/2}/T$ and $m/\sqrt{T}$. We complement our algorithmic result with a nearly matching lower bound of $Ω(mκ/T)$.

preprint2020arXiv

How to Count Triangles, without Seeing the Whole Graph

Triangle counting is a fundamental problem in the analysis of large graphs. There is a rich body of work on this problem, in varying streaming and distributed models, yet all these algorithms require reading the whole input graph. In many scenarios, we do not have access to the whole graph, and can only sample a small portion of the graph (typically through crawling). In such a setting, how can we accurately estimate the triangle count of the graph? We formally study triangle counting in the {\em random walk} access model introduced by Dasgupta et al (WWW '14) and Chierichetti et al (WWW '16). We have access to an arbitrary seed vertex of the graph, and can only perform random walks. This model is restrictive in access and captures the challenges of collecting real-world graphs. Even sampling a uniform random vertex is a hard task in this model. Despite these challenges, we design a provable and practical algorithm, TETRIS, for triangle counting in this model. TETRIS is the first provably sublinear algorithm (for most natural parameter settings) that approximates the triangle count in the random walk model, for graphs with low mixing time. Our result builds on recent advances in the theory of sublinear algorithms. The final sample built by TETRIS is a careful mix of random walks and degree-biased sampling of neighborhoods. Empirically, TETRIS accurately counts triangles on a variety of large graphs, getting estimates within 5\% relative error by looking at 3\% of the number of edges.

preprint2012arXiv

Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams

Applications involving telecommunication call data records, web pages, online transactions, medical records, stock markets, climate warning systems, etc., necessitate efficient management and processing of such massively exponential amount of data from diverse sources. De-duplication or Intelligent Compression in streaming scenarios for approximate identification and elimination of duplicates from such unbounded data stream is a greater challenge given the real-time nature of data arrival. Stable Bloom Filters (SBF) addresses this problem to a certain extent. . In this work, we present several novel algorithms for the problem of approximate detection of duplicates in data streams. We propose the Reservoir Sampling based Bloom Filter (RSBF) combining the working principle of reservoir sampling and Bloom Filters. We also present variants of the novel Biased Sampling based Bloom Filter (BSBF) based on biased sampling concepts. We also propose a randomized load balanced variant of the sampling Bloom Filter approach to efficiently tackle the duplicate detection. In this work, we thus provide a generic framework for de-duplication using Bloom Filters. Using detailed theoretical analysis we prove analytical bounds on the false positive rate, false negative rate and convergence rate of the proposed structures. We exhibit that our models clearly outperform the existing methods. We also demonstrate empirical analysis of the structures using real-world datasets (3 million records) and also with synthetic datasets (1 billion records) capturing various input distributions.

Suman K. Bera

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

A New Dynamic Algorithm for Densest Subhypergraphs

Distributional Individual Fairness in Clustering

How the Degeneracy Helps for Triangle Counting in Graph Streams

How to Count Triangles, without Seeing the Whole Graph

Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams