Source author record

Sepideh Mahabadi

Sepideh Mahabadi appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Computational Geometry Machine Learning Computational Complexity Databases Discrete Mathematics

Catalog footprint

What is connected

10works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Improved Algorithms for Fair Matroid Submodular Maximization

Submodular maximization subject to matroid constraints is a central problem with many applications in machine learning. As algorithms are increasingly used in decision-making over datapoints with sensitive attributes such as gender or race, it is becoming crucial to enforce fairness to avoid bias and discrimination. Recent work has addressed the challenge of developing efficient approximation algorithms for fair matroid submodular maximization. However, the best algorithms known so far are only guaranteed to satisfy a relaxed version of the fairness constraints that loses a factor 2, i.e., the problem may ask for $\ell$ elements with a given attribute, but the algorithm is only guaranteed to find $\lfloor \ell/2 \rfloor$. In particular, there is no provable guarantee when $\ell=1$, which corresponds to a key special case of perfect matching constraints. In this work, we achieve a new trade-off via an algorithm that gets arbitrarily close to full fairness. Namely, for any constant $\varepsilon>0$, we give a constant-factor approximation to fair monotone matroid submodular maximization that in expectation loses only a factor $(1-\varepsilon)$ in the lower-bound fairness constraint. Our empirical evaluation on a standard suite of real-world datasets -- including clustering, recommendation, and coverage tasks -- demonstrates the practical effectiveness of our methods.

preprint2022arXiv

Adaptive Sketches for Robust Regression with Importance Sampling

We introduce data structures for solving robust regression through stochastic gradient descent (SGD) by sampling gradients with probability proportional to their norm, i.e., importance sampling. Although SGD is widely used for large scale machine learning, it is well-known for possibly experiencing slow convergence rates due to the high variance from uniform sampling. On the other hand, importance sampling can significantly decrease the variance but is usually difficult to implement because computing the sampling probabilities requires additional passes over the data, in which case standard gradient descent (GD) could be used instead. In this paper, we introduce an algorithm that approximately samples $T$ gradients of dimension $d$ from nearly the optimal importance sampling distribution for a robust regression problem over $n$ rows. Thus our algorithm effectively runs $T$ steps of SGD with importance sampling while using sublinear space and just making a single pass over the data. Our techniques also extend to performing importance sampling for second-order optimization.

preprint2021arXiv

Sampling a Near Neighbor in High Dimensions -- Who is the Fairest of Them All?

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points $S$ and a radius parameter $r>0$, the $r$-near neighbor ($r$-NN) problem asks for a data structure that, given any query point $q$, returns a point $p$ within distance at most $r$ from $q$. In this paper, we study the $r$-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance $r$ from the query should have the same probability to be returned. In the low-dimensional case, this problem was first studied by Hu, Qiao, and Tao (PODS 2014). Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. In this work, we show that LSH based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also develop a data structure for fair similarity search under inner product that requires nearly-linear space and exploits locality sensitive filters. The paper concludes with an experimental evaluation that highlights the inherent unfairness of NN data structures and shows the performance of our algorithms on real-world datasets.

preprint2021arXiv

Towards Better Approximation of Graph Crossing Number

Graph Crossing Number is a fundamental problem with various applications. In this problem, the goal is to draw an input graph $G$ in the plane so as to minimize the number of crossings between the images of its edges. Despite extensive work, non-trivial approximation algorithms are only known for bounded-degree graphs. Even for this special case, the best current algorithm achieves a $\tilde O(\sqrt n)$-approximation, while the best current negative result is APX-hardness. All current approximation algorithms for the problem build on the same paradigm: compute a set $E'$ of edges (called a \emph{planarizing set}) such that $G\setminus E'$ is planar; compute a planar drawing of $G\setminus E'$; then add the drawings of the edges of $E'$ to the resulting drawing. Unfortunately, there are examples of graphs, in which any implementation of this method must incur $Ω(\text{OPT}^2)$ crossings, where $\text{OPT}$ is the value of the optimal solution. This barrier seems to doom the only known approach to designing approximation algorithms for the problem, and to prevent it from yielding a better than $O(\sqrt n)$-approximation. In this paper we propose a new paradigm that allows us to overcome this barrier. We show an algorithm that, given a bounded-degree graph $G$ and a planarizing set $E'$ of its edges, computes another set $E''$ with $E'\subseteq E''$, such that $|E''|$ is relatively small, and there exists a near-optimal drawing of $G$ in which only edges of $E''$ participate in crossings. This allows us to reduce the Crossing Number problem to \emph{Crossing Number with Rotation System} -- a variant in which the ordering of the edges incident to every vertex is fixed as part of input. We show a randomized algorithm for this new problem, that allows us to obtain an $O(n^{1/2-ε})$-approximation for Crossing Number on bounded-degree graphs, for some constant $ε>0$.

preprint2020arXiv

Individual Fairness for $k$-Clustering

We give a local search based algorithm for $k$-median and $k$-means (and more generally for any $k$-clustering with $\ell_p$ norm cost function) from the perspective of individual fairness. More precisely, for a point $x$ in a point set $P$ of size $n$, let $r(x)$ be the minimum radius such that the ball of radius $r(x)$ centered at $x$ has at least $n/k$ points from $P$. Intuitively, if a set of $k$ random points are chosen from $P$ as centers, every point $x\in P$ expects to have a center within radius $r(x)$. An individually fair clustering provides such a guarantee for every point $x\in P$. This notion of fairness was introduced in [Jung et al., 2019] where they showed how to get an approximately feasible $k$-clustering with respect to this fairness condition. In this work, we show how to get a bicriteria approximation for fair $k$-clustering: The $k$-median ($k$-means) cost of our solution is within a constant factor of the cost of an optimal fair $k$-clustering, and our solution approximately satisfies the fairness condition (also within a constant factor). Further, we complement our theoretical bounds with empirical evaluation.

preprint2020arXiv

Non-Adaptive Adaptive Sampling on Turnstile Streams

Adaptive sampling is a useful algorithmic tool for data summarization problems in the classical centralized setting, where the entire dataset is available to the single processor performing the computation. Adaptive sampling repeatedly selects rows of an underlying matrix $\mathbf{A}\in\mathbb{R}^{n\times d}$, where $n\gg d$, with probabilities proportional to their distances to the subspace of the previously selected rows. Intuitively, adaptive sampling seems to be limited to trivial multi-pass algorithms in the streaming model of computation due to its inherently sequential nature of assigning sampling probabilities to each row only after the previous iteration is completed. Surprisingly, we show this is not the case by giving the first one-pass algorithms for adaptive sampling on turnstile streams and using space $\text{poly}(d,k,\log n)$, where $k$ is the number of adaptive sampling rounds to be performed. Our adaptive sampling procedure has a number of applications to various data summarization problems that either improve state-of-the-art or have only been previously studied in the more relaxed row-arrival model. We give the first relative-error algorithms for column subset selection, subspace approximation, projective clustering, and volume maximization on turnstile streams that use space sublinear in $n$. We complement our volume maximization algorithmic results with lower bounds that are tight up to lower order terms, even for multi-pass algorithms. By a similar construction, we also obtain lower bounds for volume maximization in the row-arrival model, which we match with competitive upper bounds. See paper for full abstract.

preprint2020arXiv

Streaming Complexity of SVMs

We study the space complexity of solving the bias-regularized SVM problem in the streaming model. This is a classic supervised learning problem that has drawn lots of attention, including for developing fast algorithms for solving the problem approximately. One of the most widely used algorithms for approximately optimizing the SVM objective is Stochastic Gradient Descent (SGD), which requires only $O(\frac{1}{λε})$ random samples, and which immediately yields a streaming algorithm that uses $O(\frac{d}{λε})$ space. For related problems, better streaming algorithms are only known for smooth functions, unlike the SVM objective that we focus on in this work. We initiate an investigation of the space complexity for both finding an approximate optimum of this objective, and for the related ``point estimation'' problem of sketching the data set to evaluate the function value $F_λ$ on any query $(θ, b)$. We show that, for both problems, for dimensions $d=1,2$, one can obtain streaming algorithms with space polynomially smaller than $\frac{1}{λε}$, which is the complexity of SGD for strongly convex functions like the bias-regularized SVM, and which is known to be tight in general, even for $d=1$. We also prove polynomial lower bounds for both point estimation and optimization. In particular, for point estimation we obtain a tight bound of $Θ(1/\sqrtε)$ for $d=1$ and a nearly tight lower bound of $\widetildeΩ(d/ε^2)$ for $d = Ω( \log(1/ε))$. Finally, for optimization, we prove a $Ω(1/\sqrtε)$ lower bound for $d = Ω( \log(1/ε))$, and show similar bounds when $d$ is constant.

preprint2016arXiv

Simultaneous Nearest Neighbor Search

Motivated by applications in computer vision and databases, we introduce and study the Simultaneous Nearest Neighbor Search (SNN) problem. Given a set of data points, the goal of SNN is to design a data structure that, given a collection of queries, finds a collection of close points that are compatible with each other. Formally, we are given $k$ query points $Q=q_1,\cdots,q_k$, and a compatibility graph $G$ with vertices in $Q$, and the goal is to return data points $p_1,\cdots,p_k$ that minimize (i) the weighted sum of the distances from $q_i$ to $p_i$ and (ii) the weighted sum, over all edges $(i,j)$ in the compatibility graph $G$, of the distances between $p_i$ and $p_j$. The problem has several applications, where one wants to return a set of consistent answers to multiple related queries. This generalizes well-studied computational problems, including NN, Aggregate NN and the 0-extension problem. In this paper we propose and analyze the following general two-step method for designing efficient data structures for SNN. In the first step, for each query point $q_i$ we find its (approximate) nearest neighbor point $\hat{p}_i$; this can be done efficiently using existing approximate nearest neighbor structures. In the second step, we solve an off-line optimization problem over sets $q_1,\cdots,q_k$ and $\hat{p}_1,\cdots,\hat{p}_k$; this can be done efficiently given that $k$ is much smaller than $n$. Even though $\hat{p}_1,\cdots,\hat{p}_k$ might not constitute the optimal answers to queries $q_1,\cdots,q_k$, we show that, for the unweighted case, the resulting algorithm is $O(\log k/\log \log k)$-approximation. Also, we show that the approximation factor can be in fact reduced to a constant for compatibility graphs frequently occurring in practice. Finally, we show that the "empirical approximation factor" provided by the above approach is very close to 1.

preprint2016arXiv

Towards Tight Bounds for the Streaming Set Cover Problem

We consider the classic Set Cover problem in the data stream model. For $n$ elements and $m$ sets ($m\geq n$) we give a $O(1/δ)$-pass algorithm with a strongly sub-linear $\tilde{O}(mn^δ)$ space and logarithmic approximation factor. This yields a significant improvement over the earlier algorithm of Demaine et al. [DIMV14] that uses exponentially larger number of passes. We complement this result by showing that the tradeoff between the number of passes and space exhibited by our algorithm is tight, at least when the approximation factor is equal to $1$. Specifically, we show that any algorithm that computes set cover exactly using $({1 \over 2δ}-1)$ passes must use $\tildeΩ(mn^δ)$ space in the regime of $m=O(n)$. Furthermore, we consider the problem in the geometric setting where the elements are points in $\mathbb{R}^2$ and sets are either discs, axis-parallel rectangles, or fat triangles in the plane, and show that our algorithm (with a slight modification) uses the optimal $\tilde{O}(n)$ space to find a logarithmic approximation in $O(1/δ)$ passes. Finally, we show that any randomized one-pass algorithm that distinguishes between covers of size 2 and 3 must use a linear (i.e., $Ω(mn)$) amount of space. This is the first result showing that a randomized, approximate algorithm cannot achieve a space bound that is sublinear in the input size. This indicates that using multiple passes might be necessary in order to achieve sub-linear space bounds for this problem while guaranteeing small approximation factors.

preprint2015arXiv

Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search

We introduce a new variant of the nearest neighbor search problem, which allows for some coordinates of the dataset to be arbitrarily corrupted or unknown. Formally, given a dataset of $n$ points $P=\{ x_1,\ldots, x_n\}$ in high-dimensions, and a parameter $k$, the goal is to preprocess the dataset, such that given a query point $q$, one can compute quickly a point $x \in P$, such that the distance of the query to the point $x$ is minimized, when ignoring the "optimal" $k$ coordinates. Note, that the coordinates being ignored are a function of both the query point and the point returned. We present a general reduction from this problem to answering ANN queries, which is similar in spirit to LSH (locality sensitive hashing) [IM98]. Specifically, we give a sampling technique which achieves a bi-criterion approximation for this problem. If the distance to the nearest neighbor after ignoring $k$ coordinates is $r$, the data-structure returns a point that is within a distance of $O(r)$ after ignoring $O(k)$ coordinates. We also present other applications and further extensions and refinements of the above result. The new data-structures are simple and (arguably) elegant, and should be practical -- specifically, all bounds are polynomial in all relevant parameters (including the dimension of the space, and the robustness parameter $k$).

Sepideh Mahabadi

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Improved Algorithms for Fair Matroid Submodular Maximization

Adaptive Sketches for Robust Regression with Importance Sampling

Sampling a Near Neighbor in High Dimensions -- Who is the Fairest of Them All?

Towards Better Approximation of Graph Crossing Number

Individual Fairness for $k$-Clustering

Non-Adaptive Adaptive Sampling on Turnstile Streams

Streaming Complexity of SVMs

Simultaneous Nearest Neighbor Search

Towards Tight Bounds for the Streaming Set Cover Problem

Proximity in the Age of Distraction: Robust Approximate Nearest Neighbor Search