Source author record

David García-Soriano

David García-Soriano appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Machine Learning

Catalog footprint

What is connected

4works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Fair-by-design matching

Matching algorithms are used routinely to match donors to recipients for solid organs transplantation, for the assignment of medical residents to hospitals, record linkage in databases, scheduling jobs on machines, network switching, online advertising, and image recognition, among others. Although many optimal solutions may exist to a given matching problem, when the elements that shall or not be included in a solution correspond to individuals, it becomes of paramount importance that the solution be selected fairly. In this paper we study individual fairness in matching problems. Given that many maximum matchings may exist, each one satisfying a different set of individuals, the only way to guarantee fairness is through randomization. Hence we introduce the distributional maxmin fairness framework which provides, for any given input instance, the strongest guarantee possible simultaneously for all individuals in terms of satisfaction probability (the probability of being matched in the solution). Specifically, a probability distribution over feasible solutions is maxmin-fair if it is not possible to improve the satisfaction probability of any individual without decreasing it for some other individual which is no better off. In the special case of matchings in bipartite graphs, our framework is equivalent to the egalitarian mechanism of Bogomolnaia and Mouline. Our main contribution is a polynomial-time algorithm for fair matching building on techniques from minimum cuts, and edge-coloring algorithms for regular bipartite graphs, and transversal theory. For bipartite graphs, our algorithm runs in $O((|V|^2 + |E||V|^{2/3}) \cdot (\log |V|)^2)$ expected time and scales to graphs with tens of millions of vertices and hundreds of millions of edges. To the best of our knowledge, this provides the first large-scale implementation of the egalitarian mechanism.

preprint2020arXiv

Query-Efficient Correlation Clustering

Correlation clustering is arguably the most natural formulation of clustering. Given n objects and a pairwise similarity measure, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters. A main drawback of correlation clustering is that it requires as input the $Θ(n^2)$ pairwise similarities. This is often infeasible to compute or even just to store. In this paper we study \emph{query-efficient} algorithms for correlation clustering. Specifically, we devise a correlation clustering algorithm that, given a budget of $Q$ queries, attains a solution whose expected number of disagreements is at most $3\cdot OPT + O(\frac{n^3}{Q})$, where $OPT$ is the optimal cost for the instance. Its running time is $O(Q)$, and can be easily made non-adaptive (meaning it can specify all its queries at the outset and make them in parallel) with the same guarantees. Up to constant factors, our algorithm yields a provably optimal trade-off between the number of queries $Q$ and the worst-case error attained, even for adaptive algorithms. Finally, we perform an experimental study of our proposed method on both synthetic and real data, showing the scalability and the accuracy of our algorithm.

preprint2015arXiv

The Power of Both Choices: Practical Load Balancing for Distributed Stream Processing Engines

We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce Partial Key Grouping (PKG), a new stream partitioning scheme that adapts the classical "power of two choices" to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load balancing than key grouping while being more scalable than shuffle grouping. We test PKG on several large datasets, both real-world and synthetic. Compared to standard hashing, PKG reduces the load imbalance by up to several orders of magnitude, and often achieves nearly-perfect load balance. This result translates into an improvement of up to 60% in throughput and up to 45% in latency when deployed on a real Storm cluster.

preprint2013arXiv

Local correlation clustering

Correlation clustering is perhaps the most natural formulation of clustering. Given $n$ objects and a pairwise similarity measure, the goal is to cluster the objects so that, to the best possible extent, similar objects are put in the same cluster and dissimilar objects are put in different clusters. Despite its theoretical appeal, the practical relevance of correlation clustering still remains largely unexplored, mainly due to the fact that correlation clustering requires the $Θ(n^2)$ pairwise similarities as input. In this paper we initiate the investigation into \emph{local} algorithms for correlation clustering. In \emph{local correlation clustering} we are given the identifier of a single object and we want to return the cluster to which it belongs in some globally consistent near-optimal clustering, using a small number of similarity queries. Local algorithms for correlation clustering open the door to \emph{sublinear-time} algorithms, which are particularly useful when the similarity between items is costly to compute, as it is often the case in many practical application domains. They also imply $(i)$ distributed and streaming clustering algorithms, $(ii)$ constant-time estimators and testers for cluster edit distance, and $(iii)$ property-preserving parallel reconstruction algorithms for clusterability. Specifically, we devise a local clustering algorithm attaining a $(3, \varepsilon)$-approximation in time $O(1/\varepsilon^2)$ independently of the dataset size. An explicit approximate clustering for all objects can be produced in time $O(n/\varepsilon)$ (which is provably optimal). We also provide a fully additive $(1,\varepsilon)$-approximation with local query complexity $poly(1/\varepsilon)$ and time complexity $2^{poly(1/\varepsilon)}$. The latter yields the fastest polynomial-time approximation scheme for correlation clustering known to date.