Source author record

Chris Schwiegelshohn

Chris Schwiegelshohn appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Machine Learning Computational Geometry Computer Science and Game Theory Cryptography and Security cs.CY Social and Information Networks

Catalog footprint

What is connected

13works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Profit Maximization in Bilateral Trade against a Smooth Adversary

Bilateral trade models the task of intermediating between two strategic agents, a seller and a buyer, who wish to trade a good. We study this problem from the perspective of a profit-maximizing broker within an online learning framework, where the agents' valuations are generated by a smooth adversary. We devise a learning algorithm that guarantees a $\tilde{O}(\sqrt{T})$ regret bound, which is tight in the time horizon $T$ up to poly-logarithmic factors. This matches the minimax rate for the stochastic i.i.d. case, and is also well separated from the adversarial setting, where sublinear-regret is unattainable. By extending the strong regret guarantees from the i.i.d. case to the smooth adversary, we significantly broaden the scope of settings where such fast rate is achievable, while closing an important gap in the regret landscape of this fundamental economic problem. To overcome the challenges posed by this adversary, we leverage a continuity property of smooth instances and combines this with a hierarchical net-construction of the broker's action space, which is analyzed via algorithmic chaining. We showcase the applicability of these techniques by deriving a similarly tight $\tilde{O}(\sqrt{T})$ regret bound for a related mechanism design model: the joint ads problem.

preprint2022arXiv

A New Coreset Framework for Clustering

Given a metric space, the $(k,z)$-clustering problem consists of finding $k$ centers such that the sum of the of distances raised to the power $z$ of every point to its closest center is minimized. This encapsulates the famous $k$-median ($z=1$) and $k$-means ($z=2$) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as \emph{coresets}, has been an important research direction over the last 15 years. In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases.

preprint2022arXiv

An Empirical Evaluation of $k$-Means Coresets

Coresets are among the most popular paradigms for summarizing data. In particular, there exist many high performance coresets for clustering problems such as $k$-means in both theory and practice. Curiously, there exists no work on comparing the quality of available $k$-means coresets. In this paper we perform such an evaluation. There currently is no algorithm known to measure the distortion of a candidate coreset. We provide some evidence as to why this might be computationally difficult. To complement this, we propose a benchmark for which we argue that computing coresets is challenging and which also allows us an easy (heuristic) evaluation of coresets. Using this benchmark and real-world data sets, we conduct an exhaustive evaluation of the most commonly used coreset algorithms from theory and practice.

preprint2022arXiv

Scalable Differentially Private Clustering via Hierarchically Separated Trees

We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / ε^2)$, where $ε$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines.

preprint2022arXiv

Towards Optimal Lower Bounds for k-median and k-means Coresets

Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers, such that the sum of distances raised to the power of $z$ of every data point to its closest center is minimized. Special cases include the famous k-median problem ($z = 1$) and k-means problem ($z = 2$). The $k$-median and $k$-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative $(1 \pm \varepsilon)$ factor, hence reducing from large to small scale the input to the problem. In this paper, we present improved lower bounds for coresets in various metric spaces. In finite metrics consisting of $n$ points and doubling metrics with doubling constant $D$, we show that any coreset for $(k,z)$ clustering must consist of at least $Ω(k \varepsilon^{-2} \log n)$ and $Ω(k \varepsilon^{-2} D)$ points, respectively. Both bounds match previous upper bounds up to polylog factors. In Euclidean spaces, we show that any coreset for $(k,z)$ clustering must consists of at least $Ω(k\varepsilon^{-2})$ points. We complement these lower bounds with a coreset construction consisting of at most $\tilde{O}(k\varepsilon^{-2}\cdot \min(\varepsilon^{-z},k))$ points.

preprint2021arXiv

Efficient Similarity Search in Dynamic Data Streams

The Jaccard index is an important similarity measure for item sets and Boolean data. On large datasets, an exact similarity computation is often infeasible for all item pairs both due to time and space constraints, giving rise to faster approximate methods. The algorithm of choice used to quickly compute the Jaccard index $\frac{\vert A \cap B \vert}{\vert A\cup B\vert}$ of two item sets $A$ and $B$ is usually a form of min-hashing. Most min-hashing schemes are maintainable in data streams processing only additions, but none are known to work when facing item-wise deletions. In this paper, we investigate scalable approximation algorithms for rational set similarities, a broad class of similarity measures including Jaccard. Motivated by a result of Chierichetti and Kumar [J. ACM 2015] who showed any rational set similarity $S$ admits a locality sensitive hashing (LSH) scheme if and only if the corresponding distance $1-S$ is a metric, we can show that there exists a space efficient summary maintaining a $(1\pm \varepsilon)$ multiplicative approximation to $1-S$ in dynamic data streams. This in turn also yields a $\varepsilon$ additive approximation of the similarity. The existence of these approximations hints at, but does not directly imply a LSH scheme in dynamic data streams. Our second and main contribution now lies in the design of such a LSH scheme maintainable in dynamic data streams. The scheme is space efficient, easy to implement and to the best of our knowledge the first of its kind able to process deletions.

preprint2021arXiv

Fair Clustering with Multiple Colors

A fair clustering instance is given a data set $A$ in which every point is assigned some color. Colors correspond to various protected attributes such as sex, ethnicity, or age. A fair clustering is an instance where membership of points in a cluster is uncorrelated with the coloring of the points. Of particular interest is the case where all colors are equally represented. If we have exactly two colors, Chierrichetti, Kumar, Lattanzi and Vassilvitskii (NIPS 2017) showed that various $k$-clustering objectives admit a constant factor approximation. Since then, a number of follow up work has attempted to extend this result to a multi-color case, though so far, the only known results either result in no-constant factor approximation, apply only to special clustering objectives such as $k$-center, yield bicrititeria approximations, or require $k$ to be constant. In this paper, we present a simple reduction from unconstrained $k$-clustering to fair $k$-clustering for a large range of clustering objectives including $k$-median, $k$-means, and $k$-center. The reduction loses only a constant factor in the approximation guarantee, marking the first true constant factor approximation for many of these problems.

preprint2021arXiv

Fair Coresets and Streaming Algorithms for Fair k-Means Clustering

We study fair clustering problems as proposed by Chierichetti et al. (NIPS 2017). Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well. We show how to model and compute so-called coresets for fair clustering problems, which can be used to significantly reduce the input data size. We prove that the coresets are composable and show how to compute them in a streaming setting. Furthermore, we propose a variant of Lloyd's algorithm that computes fair clusterings and extend it to a fair k-means++ clustering algorithm. We implement these algorithms and provide empirical evidence that the combination of our approximation algorithms and the coreset construction yields a scalable algorithm for fair k-means clustering.

preprint2021arXiv

On Coresets for Logistic Regression

Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $μ(X)$, which quantifies the hardness of compressing a data set for logistic regression. $μ(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $μ(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1\pm\varepsilon)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.

preprint2021arXiv

Principal Fairness: Removing Bias via Projections

Reducing hidden bias in the data and ensuring fairness in algorithmic data analysis has recently received significant attention. We complement several recent papers in this line of research by introducing a general method to reduce bias in the data through random projections in a "fair" subspace. We apply this method to densest subgraph problem. For densest subgraph, our approach based on fair projections allows to recover both theoretically and empirically an almost optimal, fair, dense subgraph hidden in the input data. We also show that, under the small set expansion hypothesis, approximating this problem beyond a factor of 2 is NP-hard and we show a polynomial time algorithm with a matching approximation bound.

preprint2020arXiv

Algorithms for Fair Team Formation in Online Labour Marketplaces

As freelancing work keeps on growing almost everywhere due to a sharp decrease in communication costs and to the widespread of Internet-based labour marketplaces (e.g., guru.com, feelancer.com, mturk.com, upwork.com), many researchers and practitioners have started exploring the benefits of outsourcing and crowdsourcing. Since employers often use these platforms to find a group of workers to complete a specific task, researchers have focused their efforts on the study of team formation and matching algorithms and on the design of effective incentive schemes. Nevertheless, just recently, several concerns have been raised on possibly unfair biases introduced through the algorithms used to carry out these selection and matching procedures. For this reason, researchers have started studying the fairness of algorithms related to these online marketplaces, looking for intelligent ways to overcome the algorithmic bias that frequently arises. Broadly speaking, the aim is to guarantee that, for example, the process of hiring workers through the use of machine learning and algorithmic data analysis tools does not discriminate, even unintentionally, on grounds of nationality or gender. In this short paper, we define the Fair Team Formation problem in the following way: given an online labour marketplace where each worker possesses one or more skills, and where all workers are divided into two or more not overlapping classes (for examples, men and women), we want to design an algorithm that is able to find a team with all the skills needed to complete a given task, and that has the same number of people from all classes. We provide inapproximability results for the Fair Team Formation problem together with four algorithms for the problem itself. We also tested the effectiveness of our algorithmic solutions by performing experiments using real data from an online labor marketplace.

preprint2020arXiv

Random Projections for k-Means: Maintaining Coresets Beyond Merge & Reduce

We give a new construction for a small space summary satisfying the coreset guarantee of a data set with respect to the $k$-means objective function. The number of points required in an offline construction is in $\tilde{O}(k ε^{-2}\min(d,kε^{-2}))$ which is minimal among all available constructions. Aside from two constructions with exponential dependence on the dimension, all known coresets are maintained in data streams via the merge and reduce framework, which incurs are large space dependency on $\log n$. Instead, our construction crucially relies on Johnson-Lindenstrauss type embeddings which combined with results from online algorithms give us a new technique for efficiently maintaining coresets in data streams without relying on merge and reduce. The final number of points stored by our algorithm in a data stream is in $\tilde{O}(k^2 ε^{-2} \log^2 n \min(d,kε^{-2}))$.

preprint2015arXiv

Sublinear Estimation of Weighted Matchings in Dynamic Data Streams

This paper presents an algorithm for estimating the weight of a maximum weighted matching by augmenting any estimation routine for the size of an unweighted matching. The algorithm is implementable in any streaming model including dynamic graph streams. We also give the first constant estimation for the maximum matching size in a dynamic graph stream for planar graphs (or any graph with bounded arboricity) using $\tilde{O}(n^{4/5})$ space which also extends to weighted matching. Using previous results by Kapralov, Khanna, and Sudan (2014) we obtain a $\mathrm{polylog}(n)$ approximation for general graphs using $\mathrm{polylog}(n)$ space in random order streams, respectively. In addition, we give a space lower bound of $Ω(n^{1-\varepsilon})$ for any randomized algorithm estimating the size of a maximum matching up to a $1+O(\varepsilon)$ factor for adversarial streams.

Chris Schwiegelshohn

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Profit Maximization in Bilateral Trade against a Smooth Adversary

A New Coreset Framework for Clustering

An Empirical Evaluation of $k$-Means Coresets

Scalable Differentially Private Clustering via Hierarchically Separated Trees

Towards Optimal Lower Bounds for k-median and k-means Coresets

Efficient Similarity Search in Dynamic Data Streams

Fair Clustering with Multiple Colors

Fair Coresets and Streaming Algorithms for Fair k-Means Clustering

On Coresets for Logistic Regression

Principal Fairness: Removing Bias via Projections

Algorithms for Fair Team Formation in Online Labour Marketplaces

Random Projections for k-Means: Maintaining Coresets Beyond Merge & Reduce

Sublinear Estimation of Weighted Matchings in Dynamic Data Streams