Source author record

Marc Bury

Marc Bury appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms

Catalog footprint

What is connected

4works

1topics

2close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

Efficient Similarity Search in Dynamic Data Streams

The Jaccard index is an important similarity measure for item sets and Boolean data. On large datasets, an exact similarity computation is often infeasible for all item pairs both due to time and space constraints, giving rise to faster approximate methods. The algorithm of choice used to quickly compute the Jaccard index $\frac{\vert A \cap B \vert}{\vert A\cup B\vert}$ of two item sets $A$ and $B$ is usually a form of min-hashing. Most min-hashing schemes are maintainable in data streams processing only additions, but none are known to work when facing item-wise deletions. In this paper, we investigate scalable approximation algorithms for rational set similarities, a broad class of similarity measures including Jaccard. Motivated by a result of Chierichetti and Kumar [J. ACM 2015] who showed any rational set similarity $S$ admits a locality sensitive hashing (LSH) scheme if and only if the corresponding distance $1-S$ is a metric, we can show that there exists a space efficient summary maintaining a $(1\pm \varepsilon)$ multiplicative approximation to $1-S$ in dynamic data streams. This in turn also yields a $\varepsilon$ additive approximation of the similarity. The existence of these approximations hints at, but does not directly imply a LSH scheme in dynamic data streams. Our second and main contribution now lies in the design of such a LSH scheme maintainable in dynamic data streams. The scheme is space efficient, easy to implement and to the best of our knowledge the first of its kind able to process deletions.

preprint2020arXiv

Random Projections for k-Means: Maintaining Coresets Beyond Merge & Reduce

We give a new construction for a small space summary satisfying the coreset guarantee of a data set with respect to the $k$-means objective function. The number of points required in an offline construction is in $\tilde{O}(k ε^{-2}\min(d,kε^{-2}))$ which is minimal among all available constructions. Aside from two constructions with exponential dependence on the dimension, all known coresets are maintained in data streams via the merge and reduce framework, which incurs are large space dependency on $\log n$. Instead, our construction crucially relies on Johnson-Lindenstrauss type embeddings which combined with results from online algorithms give us a new technique for efficiently maintaining coresets in data streams without relying on merge and reduce. The final number of points stored by our algorithm in a data stream is in $\tilde{O}(k^2 ε^{-2} \log^2 n \min(d,kε^{-2}))$.

preprint2015arXiv

OBDDs and (Almost) $k$-wise Independent Random Variables

OBDD-based graph algorithms deal with the characteristic function of the edge set E of a graph $G = (V,E)$ which is represented by an OBDD and solve optimization problems by mainly using functional operations. We present an OBDD-based algorithm which uses randomization for the first time. In particular, we give a maximal matching algorithm with $O(\log^3 \vert V \vert)$ functional operations in expectation. This algorithm may be of independent interest. The experimental evaluation shows that this algorithm outperforms known OBDD-based algorithms for the maximal matching problem. In order to use randomization, we investigate the OBDD complexity of $2^n$ (almost) $k$-wise independent binary random variables. We give a OBDD construction of size $O(n)$ for $3$-wise independent random variables and show a lower bound of $2^{Ω(n)}$ on the OBDD size for $k \geq 4$. The best known lower bound was $Ω(2^n/n)$ for $k \approx \log n$ due to Kabanets. We also give a very simple construction of $2^n$ $(\varepsilon, k)$-wise independent binary random variables by constructing a random OBDD of width $O(n k^2/\varepsilon)$.

preprint2015arXiv

Sublinear Estimation of Weighted Matchings in Dynamic Data Streams

This paper presents an algorithm for estimating the weight of a maximum weighted matching by augmenting any estimation routine for the size of an unweighted matching. The algorithm is implementable in any streaming model including dynamic graph streams. We also give the first constant estimation for the maximum matching size in a dynamic graph stream for planar graphs (or any graph with bounded arboricity) using $\tilde{O}(n^{4/5})$ space which also extends to weighted matching. Using previous results by Kapralov, Khanna, and Sudan (2014) we obtain a $\mathrm{polylog}(n)$ approximation for general graphs using $\mathrm{polylog}(n)$ space in random order streams, respectively. In addition, we give a space lower bound of $Ω(n^{1-\varepsilon})$ for any randomized algorithm estimating the size of a maximum matching up to a $1+O(\varepsilon)$ factor for adversarial streams.

Marc Bury

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

Efficient Similarity Search in Dynamic Data Streams

Random Projections for k-Means: Maintaining Coresets Beyond Merge & Reduce

OBDDs and (Almost) $k$-wise Independent Random Variables

Sublinear Estimation of Weighted Matchings in Dynamic Data Streams