Source author record

Christian Sohler

Christian Sohler appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Machine Learning Computational Geometry Computation Discrete Mathematics Social and Information Networks

Catalog footprint

What is connected

13works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Constant Approximation for Normalized Modularity and Associations Clustering

We study the problem of graph clustering under a broad class of objectives in which the quality of a cluster is defined based on the ratio between the number of edges in the cluster, and the total weight of vertices in the cluster. We show that our definition is closely related to popular clustering measures, namely normalized associations, which is a dual of the normalized cut objective, and normalized modularity. We give a linear time constant-approximate algorithm for our objective, which implies the first constant-factor approximation algorithms for normalized modularity and normalized associations.

preprint2022arXiv

Motif Cut Sparsifiers

A motif is a frequently occurring subgraph of a given directed or undirected graph $G$. Motifs capture higher order organizational structure of $G$ beyond edge relationships, and, therefore, have found wide applications such as in graph clustering, community detection, and analysis of biological and physical networks to name a few. In these applications, the cut structure of motifs plays a crucial role as vertices are partitioned into clusters by cuts whose conductance is based on the number of instances of a particular motif, as opposed to just the number of edges, crossing the cuts. In this paper, we introduce the concept of a motif cut sparsifier. We show that one can compute in polynomial time a sparse weighted subgraph $G'$ with only $\widetilde{O}(n/ε^2)$ edges such that for every cut, the weighted number of copies of $M$ crossing the cut in $G'$ is within a $1+ε$ factor of the number of copies of $M$ crossing the cut in $G$, for every constant size motif $M$. Our work carefully combines the viewpoints of both graph sparsification and hypergraph sparsification. We sample edges which requires us to extend and strengthen the concept of cut sparsifiers introduced in the seminal work of to the motif setting. We adapt the importance sampling framework through the viewpoint of hypergraph sparsification by deriving the edge sampling probabilities from the strong connectivity values of a hypergraph whose hyperedges represent motif instances. Finally, an iterative sparsification primitive inspired by both viewpoints is used to reduce the number of edges in $G$ to nearly linear. In addition, we present a strong lower bound ruling out a similar result for sparsification with respect to induced occurrences of motifs.

preprint2022arXiv

Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension

We obtain the first strong coresets for the $k$-median and subspace approximation problems with sum of distances objective function, on $n$ points in $d$ dimensions, with a number of weighted points that is independent of both $n$ and $d$; namely, our coresets have size $\text{poly}(k/ε)$. A strong coreset $(1+ε)$-approximates the cost function for all possible sets of centers simultaneously. We also give efficient $\text{nnz}(A) + (n+d)\text{poly}(k/ε) + \exp(\text{poly}(k/ε))$ time algorithms for computing these coresets. We obtain the result by introducing a new dimensionality reduction technique for coresets that significantly generalizes an earlier result of Feldman, Sohler and Schmidt \cite{FSS13} for squared Euclidean distances to sums of $p$-th powers of Euclidean distances for constant $p\ge1$.

preprint2021arXiv

Fair Coresets and Streaming Algorithms for Fair k-Means Clustering

We study fair clustering problems as proposed by Chierichetti et al. (NIPS 2017). Here, points have a sensitive attribute and all clusters in the solution are required to be balanced with respect to it (to counteract any form of data-inherent bias). Previous algorithms for fair clustering do not scale well. We show how to model and compute so-called coresets for fair clustering problems, which can be used to significantly reduce the input data size. We prove that the coresets are composable and show how to compute them in a streaming setting. Furthermore, we propose a variant of Lloyd's algorithm that computes fair clusterings and extend it to a fair k-means++ clustering algorithm. We implement these algorithms and provide empirical evidence that the combination of our approximation algorithms and the coreset construction yields a scalable algorithm for fair k-means clustering.

preprint2021arXiv

On Coresets for Logistic Regression

Coresets are one of the central methods to facilitate the analysis of large data sets. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show a negative result, namely, that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $μ(X)$, which quantifies the hardness of compressing a data set for logistic regression. $μ(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $μ(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1\pm\varepsilon)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.

preprint2016arXiv

Theoretical Analysis of the $k$-Means Algorithm - A Survey

The $k$-means algorithm is one of the most widely used clustering heuristics. Despite its simplicity, analyzing its running time and quality of approximation is surprisingly difficult and can lead to deep insights that can be used to improve the algorithm. In this paper we survey the recent results in this direction as well as several extension of the basic $k$-means method.

preprint2015arXiv

Clustering time series under the Fréchet distance

The Fréchet distance is a popular distance measure for curves. We study the problem of clustering time series under the Fréchet distance. In particular, we give $(1+\varepsilon)$-approximation algorithms for variations of the following problem with parameters $k$ and $\ell$. Given $n$ univariate time series $P$, each of complexity at most $m$, we find $k$ time series, not necessarily from $P$, which we call \emph{cluster centers} and which each have complexity at most $\ell$, such that (a) the maximum distance of an element of $P$ to its nearest cluster center or (b) the sum of these distances is minimized. Our algorithms have running time near-linear in the input size for constant $\varepsilon$, $k$ and $\ell$. To the best of our knowledge, our algorithms are the first clustering algorithms for the Fréchet distance which achieve an approximation factor of $(1+\varepsilon)$ or better. Keywords: time series, longitudinal data, functional data, clustering, Fréchet distance, dynamic time warping, approximation algorithms.

preprint2015arXiv

Random projections for Bayesian regression

This article deals with random projections applied as a data reduction technique for Bayesian regression analysis. We show sufficient conditions under which the entire $d$-dimensional distribution is approximately preserved under random projections by reducing the number of data points from $n$ to $k\in O(\operatorname{poly}(d/\varepsilon))$ in the case $n\gg d$. Under mild assumptions, we prove that evaluating a Gaussian likelihood function based on the projected data instead of the original data yields a $(1+O(\varepsilon))$-approximation in terms of the $\ell_2$ Wasserstein distance. Our main result shows that the posterior distribution of Bayesian linear regression is approximated up to a small error depending on only an $\varepsilon$-fraction of its defining parameters. This holds when using arbitrary Gaussian priors or the degenerate case of uniform distributions over $\mathbb{R}^d$ for $β$. Our empirical evaluations involve different simulated settings of Bayesian linear regression. Our experiments underline that the proposed method is able to recover the regression model up to small error while considerably reducing the total running time.

preprint2015arXiv

Testing Cluster Structure of Graphs

We study the problem of recognizing the cluster structure of a graph in the framework of property testing in the bounded degree model. Given a parameter $\varepsilon$, a $d$-bounded degree graph is defined to be $(k, ϕ)$-clusterable, if it can be partitioned into no more than $k$ parts, such that the (inner) conductance of the induced subgraph on each part is at least $ϕ$ and the (outer) conductance of each part is at most $c_{d,k}\varepsilon^4ϕ^2$, where $c_{d,k}$ depends only on $d,k$. Our main result is a sublinear algorithm with the running time $\widetilde{O}(\sqrt{n}\cdot\mathrm{poly}(ϕ,k,1/\varepsilon))$ that takes as input a graph with maximum degree bounded by $d$, parameters $k$, $ϕ$, $\varepsilon$, and with probability at least $\frac23$, accepts the graph if it is $(k,ϕ)$-clusterable and rejects the graph if it is $\varepsilon$-far from $(k, ϕ^*)$-clusterable for $ϕ^* = c'_{d,k}\frac{ϕ^2 \varepsilon^4}{\log n}$, where $c'_{d,k}$ depends only on $d,k$. By the lower bound of $Ω(\sqrt{n})$ on the number of queries needed for testing graph expansion, which corresponds to $k=1$ in our problem, our algorithm is asymptotically optimal up to polylogarithmic factors.

preprint2014arXiv

Analysis of Agglomerative Clustering

The diameter $k$-clustering problem is the problem of partitioning a finite subset of $\mathbb{R}^d$ into $k$ subsets called clusters such that the maximum diameter of the clusters is minimized. One early clustering algorithm that computes a hierarchy of approximate solutions to this problem (for all values of $k$) is the agglomerative clustering algorithm with the complete linkage strategy. For decades, this algorithm has been widely used by practitioners. However, it is not well studied theoretically. In this paper, we analyze the agglomerative complete linkage clustering algorithm. Assuming that the dimension $d$ is a constant, we show that for any $k$ the solution computed by this algorithm is an $O(\log k)$-approximation to the diameter $k$-clustering problem. Our analysis does not only hold for the Euclidean distance but for any metric that is based on a norm. Furthermore, we analyze the closely related $k$-center and discrete $k$-center problem. For the corresponding agglomerative algorithms, we deduce an approximation factor of $O(\log k)$ as well.

preprint2014arXiv

Asymptotically exact streaming algorithms

We introduce a new computational model for data streams: asymptotically exact streaming algorithms. These algorithms have an approximation ratio that tends to one as the length of the stream goes to infinity while the memory used by the algorithm is restricted to polylog(n) size. Thus, the output of the algorithm is optimal in the limit. We show positive results in our model for a series of important problems that have been discussed in the streaming literature. These include computing the frequency moments, clustering problems and least squares regression. Our results also include lower bounds for problems, which have streaming algorithms in the ordinary setting but do not allow for sublinear space algorithms in our model.

preprint2013arXiv

Property-Testing in Sparse Directed Graphs: 3-Star-Freeness and Connectivity

We study property testing in directed graphs in the bounded degree model, where we assume that an algorithm may only query the outgoing edges of a vertex, a model proposed by Bender and Ron in 2002. As our first main result, we we present a property testing algorithm for strong connectivity in this model, having a query complexity of $\mathcal{O}(n^{1-ε/(3+α)})$ for arbitrary $α>0$; it is based on a reduction to estimating the vertex indegree distribution. For subgraph-freeness we give a property testing algorithm with a query complexity of $\mathcal{O}(n^{1-1/k})$, where $k$ is the number of connected componentes in the queried subgraph which have no incoming edge. We furthermore take a look at the problem of testing whether a weakly connected graph contains vertices with a degree of least $3$, which can be viewed as testing for freeness of all orientations of $3$-stars; as our second main result, we show that this property can be tested with a query complexity of $\mathcal{O}(\sqrt{n})$ instead of, what would be expected, $Ω(n^{2/3})$.

preprint2012arXiv

Finding Cycles and Trees in Sublinear Time

We present sublinear-time (randomized) algorithms for finding simple cycles of length at least $k\geq 3$ and tree-minors in bounded-degree graphs. The complexity of these algorithms is related to the distance of the graph from being $C_k$-minor-free (resp., free from having the corresponding tree-minor). In particular, if the graph is far (i.e., $Ω(1)$-far) {from} being cycle-free, i.e. if one has to delete a constant fraction of edges to make it cycle-free, then the algorithm finds a cycle of polylogarithmic length in time $\tildeO(\sqrt{N})$, where $N$ denotes the number of vertices. This time complexity is optimal up to polylogarithmic factors. The foregoing results are the outcome of our study of the complexity of {\em one-sided error} property testing algorithms in the bounded-degree graphs model. For example, we show that cycle-freeness of $N$-vertex graphs can be tested with one-sided error within time complexity $\tildeO(\poly(1/\e)\cdot\sqrt{N})$. This matches the known $Ω(\sqrt{N})$ query lower bound, and contrasts with the fact that any minor-free property admits a {\em two-sided error} tester of query complexity that only depends on the proximity parameter $\e$. For any constant $k\geq3$, we extend this result to testing whether the input graph has a simple cycle of length at least $k$. On the other hand, for any fixed tree $T$, we show that $T$-minor-freeness has a one-sided error tester of query complexity that only depends on the proximity parameter $\e$. Our algorithm for finding cycles in bounded-degree graphs extends to general graphs, where distances are measured with respect to the actual number of edges. Such an extension is not possible with respect to finding tree-minors in $o(\sqrt{N})$ complexity.

Christian Sohler

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Constant Approximation for Normalized Modularity and Associations Clustering

Motif Cut Sparsifiers

Strong Coresets for k-Median and Subspace Approximation: Goodbye Dimension

Fair Coresets and Streaming Algorithms for Fair k-Means Clustering

On Coresets for Logistic Regression

Theoretical Analysis of the $k$-Means Algorithm - A Survey

Clustering time series under the Fréchet distance

Random projections for Bayesian regression

Testing Cluster Structure of Graphs

Analysis of Agglomerative Clustering

Asymptotically exact streaming algorithms

Property-Testing in Sparse Directed Graphs: 3-Star-Freeness and Connectivity

Finding Cycles and Trees in Sublinear Time