Source author record

Samir Khuller

Samir Khuller appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Discrete Mathematics Machine Learning Distributed, Parallel, and Cluster Computing Artificial Intelligence Computational Complexity Computational Geometry Databases math.CO Robotics

Catalog footprint

What is connected

21works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Balancing Flow Time and Energy Consumption

In this paper, we study the following batch scheduling model: find a schedule that minimizes total flow time for $n$ uniform length jobs, with release times and deadlines, where the machine is only actively processing jobs in at most $k$ synchronized batches of size at most $B$. Prior work on such batch scheduling models has considered only feasibility with no regard to the flow time of the schedule. However, algorithms that minimize the cost from the scheduler's perspective -- such as ones that minimize the active time of the processor -- can result in schedules where the total flow time is arbitrarily high \cite{ChangGabowKhuller}. Such schedules are not valuable from the perspective of the client. In response, our work provides dynamic programs which minimize flow time subject to active time constraints. Our main contribution focuses on jobs with agreeable deadlines; for such job instances, we introduce dynamic programs that achieve runtimes of O$(B \cdot k \cdot n)$ for unit jobs and O$(B \cdot k \cdot n^5)$ for uniform length jobs. These results improve upon our modification of a different, classical dynamic programming approach by Baptiste. While the modified DP works when deadlines are non-agreeable, this solution is more expensive, with runtime $O(B \cdot k^2 \cdot n^7)$ \cite{Baptiste00}.

preprint2022arXiv

Constant factor Approximation Algorithms for Uniform Hard Capacitated Facility Location Problems: Natural LP is not too bad

In this paper, we give first constant factor approximation for capacitated knapsack median problem (CKM) for hard uniform capacities, violating the budget only by an additive factor of $f_{max}$ where $f_{max}$ is the maximum cost of a facility opened by the optimal and violating capacities by $(2+ε)$ factor. Natural LP for the problem is known to have an unbounded integrality gap when any one of the two constraints is allowed to be violated by a factor less than $2$. Thus, we present a result which is very close to the best achievable from the natural LP. To the best of our knowledge, the problem has not been studied earlier. For capacitated facility location problem with uniform capacities, a constant factor approximation algorithm is presented violating the capacities a little ($1 + ε$). Though constant factor results are known for the problem without violating the capacities, the result is interesting as it is obtained by rounding the solution to the natural LP, which is known to have an unbounded integrality gap without violating the capacities. Thus, we achieve the best possible from the natural LP for the problem. The result shows that natural LP is not too bad. Finally, we raise some issues with the proofs of the results presented in \cite{capkmByrkaFRS2013} for capacitated $k$-facility location problem (C$k$FLP). \cite{capkmByrkaFRS2013} presents $O(1/ε^2)$ approximation violating the capacities by a factor of $(2 + ε)$ using dependent rounding. We first fix these issues using our techniques. Also, it can be argued that (deterministic) pipage rounding cannot be used to open the facilities instead of dependent rounding. Our techniques for CKM provide a constant factor approximation for CkFLP violating the capacities by $(2 + ε)$.

preprint2022arXiv

Correlated Stochastic Knapsack with a Submodular Objective

We study the correlated stochastic knapsack problem of a submodular target function, with optional additional constraints. We utilize the multilinear extension of submodular function, and bundle it with an adaptation of the relaxed linear constraints from Ma [Mathematics of Operations Research, Volume 43(3), 2018] on correlated stochastic knapsack problem. The relaxation is then solved by the stochastic continuous greedy algorithm, and rounded by a novel method to fit the contention resolution scheme (Feldman et al. [FOCS 2011]). We obtain a pseudo-polynomial time $(1 - 1/\sqrt{e})/2 \simeq 0.1967$ approximation algorithm with or without those additional constraints, eliminating the need of a key assumption and improving on the $(1 - 1/\sqrt[4]{e})/2 \simeq 0.1106$ approximation by Fukunaga et al. [AAAI 2019].

preprint2022arXiv

Individual Preference Stability for Clustering

In this paper, we propose a natural notion of individual preference (IP) stability for clustering, which asks that every data point, on average, is closer to the points in its own cluster than to the points in any other cluster. Our notion can be motivated from several perspectives, including game theory and algorithmic fairness. We study several questions related to our proposed notion. We first show that deciding whether a given data set allows for an IP-stable clustering in general is NP-hard. As a result, we explore the design of efficient algorithms for finding IP-stable clusterings in some restricted metric spaces. We present a polytime algorithm to find a clustering satisfying exact IP-stability on the real line, and an efficient algorithm to find an IP-stable 2-clustering for a tree metric. We also consider relaxing the stability constraint, i.e., every data point should not be too far from its own cluster compared to any other cluster. For this case, we provide polytime algorithms with different guarantees. We evaluate some of our algorithms and several standard clustering approaches on real data sets.

preprint2021arXiv

Multi-transversals for Triangles and the Tuza's Conjecture

In this paper, we study a primal and dual relationship about triangles: For any graph $G$, let $ν(G)$ be the maximum number of edge-disjoint triangles in $G$, and $τ(G)$ be the minimum subset $F$ of edges such that $G \setminus F$ is triangle-free. It is easy to see that $ν(G) \leq τ(G) \leq 3 ν(G)$, and in fact, this rather obvious inequality holds for a much more general primal-dual relation between $k$-hyper matching and covering in hypergraphs. Tuza conjectured in $1981$ that $τ(G) \leq 2 ν(G)$, and this question has received attention from various groups of researchers in discrete mathematics, settling various special cases such as planar graphs and generalized to bounded maximum average degree graphs, some cases of minor-free graphs, and very dense graphs. Despite these efforts, the conjecture in general graphs has remained wide open for almost four decades. In this paper, we provide a proof of a non-trivial consequence of the conjecture; that is, for every $k \geq 2$, there exist a (multi)-set $F \subseteq E(G): |F| \leq 2k ν(G)$ such that each triangle in $G$ overlaps at least $k$ elements in $F$. Our result can be seen as a strengthened statement of Krivelevich's result on the fractional version of Tuza's conjecture (and we give some examples illustrating this.) The main technical ingredient of our result is a charging argument, that locally identifies edges in $F$ based on a local view of the packing solution. This idea might be useful in further studying the primal-dual relations in general and the Tuza's conjecture in particular.

preprint2020arXiv

A Pairwise Fair and Community-preserving Approach to k-Center Clustering

Clustering is a foundational problem in machine learning with numerous applications. As machine learning increases in ubiquity as a backend for automated systems, concerns about fairness arise. Much of the current literature on fairness deals with discrimination against protected classes in supervised learning (group fairness). We define a different notion of fair clustering wherein the probability that two points (or a community of points) become separated is bounded by an increasing function of their pairwise distance (or community diameter). We capture the situation where data points represent people who gain some benefit from being clustered together. Unfairness arises when certain points are deterministically separated, either arbitrarily or by someone who intends to harm them as in the case of gerrymandering election districts. In response, we formally define two new types of fairness in the clustering setting, pairwise fairness and community preservation. To explore the practicality of our fairness goals, we devise an approach for extending existing $k$-center algorithms to satisfy these fairness constraints. Analysis of this approach proves that reasonable approximations can be achieved while maintaining fairness. In experiments, we compare the effectiveness of our approach to classical $k$-center algorithms/heuristics and explore the tradeoff between optimal clustering and fairness.

preprint2020arXiv

An Algorithm for Multi-Attribute Diverse Matching

Bipartite b-matching, where agents on one side of a market are matched to one or more agents or items on the other, is a classical model that is used in myriad application areas such as healthcare, advertising, education, and general resource allocation. Traditionally, the primary goal of such models is to maximize a linear function of the constituent matches (e.g., linear social welfare maximization) subject to some constraints. Recent work has studied a new goal of balancing whole-match diversity and economic efficiency, where the objective is instead a monotone submodular function over the matching. Basic versions of this problem are solvable in polynomial time. In this work, we prove that the problem of simultaneously maximizing diversity along several features (e.g., country of citizenship, gender, skills) is NP-hard. To address this problem, we develop the first combinatorial algorithm that constructs provably-optimal diverse b-matchings in pseudo-polynomial time. We also provide a Mixed-Integer Quadratic formulation for the same problem and show that our method guarantees optimal solutions and takes less computation time for a reviewer assignment application.

preprint2016arXiv

LP Rounding and Combinatorial Algorithms for Minimizing Active and Busy Time

We consider fundamental scheduling problems motivated by energy issues. In this framework, we are given a set of jobs, each with a release time, deadline and required processing length. The jobs need to be scheduled on a machine so that at most g jobs are active at any given time. The duration for which a machine is active (i.e., "on") is referred to as its active time. The goal is to find a feasible schedule for all jobs, minimizing the total active time. When preemption is allowed at integer time points, we show that a minimal feasible schedule already yields a 3-approximation (and this bound is tight) and we further improve this to a 2-approximation via LP rounding techniques. Our second contribution is for the non-preemptive version of this problem. However, since even asking if a feasible schedule on one machine exists is NP-hard, we allow for an unbounded number of virtual machines, each having capacity of g. This problem is known as the busy time problem in the literature and a 4-approximation is known for this problem. We develop a new combinatorial algorithm that gives a 3-approximation. Furthermore, we consider the preemptive busy time problem, giving a simple and exact greedy algorithm when unbounded parallelism is allowed, i.e., g is unbounded. For arbitrary g, this yields an algorithm that is 2-approximate.

preprint2016arXiv

Minimizing Uncertainty through Sensor Placement with Angle Constraints

We study the problem of sensor placement in environments in which localization is a necessity, such as ad-hoc wireless sensor networks that allow the placement of a few anchors that know their location or sensor arrays that are tracking a target. In most of these situations, the quality of localization depends on the relative angle between the target and the pair of sensors observing it. In this paper, we consider placing a small number of sensors which ensure good angular $α$-coverage: given $α$ in $[0,π/2]$, for each target location $t$, there must be at least two sensors $s_1$ and $s_2$ such that the $\angle(s_1 t s_2)$ is in the interval $[α, π-α]$. One of the main difficulties encountered in such problems is that since the constraints depend on at least two sensors, building a solution must account for the inherent dependency between selected sensors, a feature that generic Set Cover techniques do not account for. We introduce a general framework that guarantees an angular coverage that is arbitrarily close to $α$ for any $α<= π/3$ and apply it to a variety of problems to get bi-criteria approximations. When the angular coverage is required to be at least a constant fraction of $α$, we obtain results that are strictly better than what standard geometric Set Cover methods give. When the angular coverage is required to be at least $(1-1/δ)\cdotα$, we obtain a $\mathcal{O}(\log δ)$- approximation for sensor placement with $α$-coverage on the plane. In the presence of additional distance or visibility constraints, the framework gives a $\mathcal{O}(\logδ\cdot\log k_{OPT})$-approximation, where $k_{OPT}$ is the size of the optimal solution. We also use our framework to give a $\mathcal{O}(\log δ)$-approximation that ensures $(1-1/δ)\cdot α$-coverage and covers every target within distance $3R$.

preprint2016arXiv

Scheduling Distributed Clusters of Parallel Machines: Primal-Dual and LP-based Approximation Algorithms [Full Version]

The Map-Reduce computing framework rose to prominence with datasets of such size that dozens of machines on a single cluster were needed for individual jobs. As datasets approach the exabyte scale, a single job may need distributed processing not only on multiple machines, but on multiple clusters. We consider a scheduling problem to minimize weighted average completion time of N jobs on M distributed clusters of parallel machines. In keeping with the scale of the problems motivating this work, we assume that (1) each job is divided into M "subjobs" and (2) distinct subjobs of a given job may be processed concurrently. When each cluster is a single machine, this is the NP-Hard concurrent open shop problem. A clear limitation of such a model is that a serial processing assumption sidesteps the issue of how different tasks of a given subjob might be processed in parallel. Our algorithms explicitly model clusters as pools of resources and effectively overcome this issue. Under a variety of parameter settings, we develop two constant factor approximation algorithms for this problem. The first algorithm uses an LP relaxation tailored to this problem from prior work. This LP-based algorithm provides strong performance guarantees. Our second algorithm exploits a surprisingly simple mapping to the special case of one machine per cluster. This mapping-based algorithm is combinatorial and extremely fast. These are the first constant factor approximations for this problem.

preprint2015arXiv

On Correcting Inputs: Inverse Optimization for Online Structured Prediction

Algorithm designers typically assume that the input data is correct, and then proceed to find "optimal" or "sub-optimal" solutions using this input data. However this assumption of correct data does not always hold in practice, especially in the context of online learning systems where the objective is to learn appropriate feature weights given some training samples. Such scenarios necessitate the study of inverse optimization problems where one is given an input instance as well as a desired output and the task is to adjust the input data so that the given output is indeed optimal. Motivated by learning structured prediction models, in this paper we consider inverse optimization with a margin, i.e., we require the given output to be better than all other feasible outputs by a desired margin. We consider such inverse optimization problems for maximum weight matroid basis, matroid intersection, perfect matchings, minimum cost maximum flows, and shortest paths and derive the first known results for such problems with a non-zero margin. The effectiveness of these algorithmic approaches to online learning for structured prediction is also discussed.

preprint2013arXiv

Analyzing the Optimal Neighborhood: Algorithms for Budgeted and Partial Connected Dominating Set Problems

We study partial and budgeted versions of the well studied connected dominating set problem. In the partial connected dominating set problem, we are given an undirected graph G = (V,E) and an integer n', and the goal is to find a minimum subset of vertices that induces a connected subgraph of G and dominates at least n' vertices. We obtain the first polynomial time algorithm with an O(\ln Δ) approximation factor for this problem, thereby significantly extending the results of Guha and Khuller (Algorithmica, Vol. 20(4), Pages 374-387, 1998) for the connected dominating set problem. We note that none of the methods developed earlier can be applied directly to solve this problem. In the budgeted connected dominating set problem, there is a budget on the number of vertices we can select, and the goal is to dominate as many vertices as possible. We obtain a (1/13)(1 - 1/e) approximation algorithm for this problem. Finally, we show that our techniques extend to a more general setting where the profit function associated with a subset of vertices is a monotone "special" submodular function. This generalization captures the connected dominating set problem with capacities and/or weighted profits as special cases. This implies a O(\ln q) approximation (where q denotes the quota) and an O(1) approximation algorithms for the partial and budgeted versions of these problems. While the algorithms are simple, the results make a surprising use of the greedy set cover framework in defining a useful profit function.

preprint2013arXiv

Data Placement and Replica Selection for Improving Co-location in Distributed Environments

Increasing need for large-scale data analytics in a number of application domains has led to a dramatic rise in the number of distributed data management systems, both parallel relational databases, and systems that support alternative frameworks like MapReduce. There is thus an increasing contention on scarce data center resources like network bandwidth; further, the energy requirements for powering the computing equipment are also growing dramatically. As we show empirically, increasing the execution parallelism by spreading out data across a large number of machines may achieve the intended goal of decreasing query latencies, but in most cases, may increase the total resource and energy consumption significantly. For many analytical workloads, however, minimizing query latencies is often not critical; in such scenarios, we argue that we should instead focus on minimizing the average query span, i.e., the average number of machines that are involved in processing of a query, through colocation of data items that are frequently accessed together. In this work, we exploit the fact that most distributed environments need to use replication for fault tolerance, and we devise workload-driven replica selection and placement algorithms that attempt to minimize the average query span. We model a historical query workload trace as a hypergraph over a set of data items, and formulate and analyze the problem of replica placement by drawing connections to several well-studied graph theoretic concepts. We develop a series of algorithms to decide which data items to replicate, and where to place the replicas. We show effectiveness of our proposed approach by presenting results on a collection of synthetic and real workloads. Our experiments show that careful data placement and replication can dramatically reduce the average query spans resulting in significant reductions in the resource consumption.

preprint2012arXiv

A Model for Minimizing Active Processor Time

We introduce the following elementary scheduling problem. We are given a collection of n jobs, where each job has an integer length as well as a set Ti of time intervals in which it can be feasibly scheduled. Given a parameter B, the processor can schedule up to B jobs at a timeslot t so long as it is "active" at t. The goal is to schedule all the jobs in the fewest number of active timeslots. The machine consumes a fixed amount of energy per active timeslot, regardless of the number of jobs scheduled in that slot (as long as the number of jobs is non-zero). In other words, subject to all units of each job being scheduled in its feasible region and at each slot at most B jobs being scheduled, we are interested in minimizing the total time during which the machine is active. We present a linear time algorithm for the case where jobs are unit length and each Ti is a single interval. For general Ti, we show that the problem is NP-complete even for B = 3. However when B = 2, we show that it can be efficiently solved. In addition, we consider a version of the problem where jobs have arbitrary lengths and can be preempted at any point in time. For general B, the problem can be solved by linear programming. For B = 2, the problem amounts to finding a triangle-free 2-matching on a special graph. We extend the algorithm of Babenko et. al. to handle our variant, and also to handle non-unit length jobs. This yields an O(sqrt(L)m) time algorithm to solve the preemptive scheduling problem for B = 2, where L is the sum of the job lengths. We also show that for B = 2 and unit length jobs, the optimal non-preemptive schedule has at most 4/3 times the active time of the optimal preemptive schedule; this bound extends to several versions of the problem when jobs have arbitrary length.

preprint2012arXiv

LP Rounding for k-Centers with Non-uniform Hard Capacities

In this paper we consider a generalization of the classical k-center problem with capacities. Our goal is to select k centers in a graph, and assign each node to a nearby center, so that we respect the capacity constraints on centers. The objective is to minimize the maximum distance a node has to travel to get to its assigned center. This problem is NP-hard, even when centers have no capacity restrictions and optimal factor 2 approximation algorithms are known. With capacities, when all centers have identical capacities, a 6 approximation is known with no better lower bounds than for the infinite capacity version. While many generalizations and variations of this problem have been studied extensively, no progress was made on the capacitated version for a general capacity function. We develop the first constant factor approximation algorithm for this problem. Our algorithm uses an LP rounding approach to solve this problem, and works for the case of non-uniform hard capacities, when multiple copies of a node may not be chosen and can be extended to the case when there is a hard bound on the number of copies of a node that may be selected. In addition we establish a lower bound on the integrality gap of 7(5) for non-uniform (uniform) hard capacities. In addition we prove that if there is a (3-eps)-factor approximation for this problem then P=NP. Finally, for non-uniform soft capacities we present a much simpler 11-approximation algorithm, which we find as one more evidence that hard capacities are much harder to deal with.

preprint2002arXiv

A Primal-Dual Parallel Approximation Technique Applied to Weighted Set and Vertex Cover

The paper describes a simple deterministic parallel/distributed (2+epsilon)-approximation algorithm for the minimum-weight vertex-cover problem and its dual (edge/element packing).

preprint2002arXiv

Approximating the Minimum Equivalent Digraph

The MEG (minimum equivalent graph) problem is, given a directed graph, to find a small subset of the edges that maintains all reachability relations between nodes. The problem is NP-hard. This paper gives an approximation algorithm with performance guarantee of pi^2/6 ~ 1.64. The algorithm and its analysis are based on the simple idea of contracting long cycles. (This result is strengthened slightly in ``On strongly connected digraphs with bounded cycle length'' (1996).) The analysis applies directly to 2-Exchange, a simple ``local improvement'' algorithm, showing that its performance guarantee is 1.75.

preprint2002arXiv

Balancing Minimum Spanning and Shortest Path Trees

This paper give a simple linear-time algorithm that, given a weighted digraph, finds a spanning tree that simultaneously approximates a shortest-path tree and a minimum spanning tree. The algorithm provides a continuous trade-off: given the two trees and epsilon > 0, the algorithm returns a spanning tree in which the distance between any vertex and the root of the shortest-path tree is at most 1+epsilon times the shortest-path distance, and yet the total weight of the tree is at most 1+2/epsilon times the weight of a minimum spanning tree. This is the best tradeoff possible. The paper also describes a fast parallel implementation.

preprint2002arXiv

Designing Multi-Commodity Flow Trees

The traditional multi-commodity flow problem assumes a given flow network in which multiple commodities are to be maximally routed in response to given demands. This paper considers the multi-commodity flow network-design problem: given a set of multi-commodity flow demands, find a network subject to certain constraints such that the commodities can be maximally routed. This paper focuses on the case when the network is required to be a tree. The main result is an approximation algorithm for the case when the tree is required to be of constant degree. The algorithm reduces the problem to the minimum-weight balanced-separator problem; the performance guarantee of the algorithm is within a factor of 4 of the performance guarantee of the balanced-separator procedure. If Leighton and Rao's balanced-separator procedure is used, the performance guarantee is O(log n). This improves the O(log^2 n) approximation factor that is trivial to obtain by a direct application of the balanced-separator method.

preprint2002arXiv

Low-Degree Spanning Trees of Small Weight

The degree-d spanning tree problem asks for a minimum-weight spanning tree in which the degree of each vertex is at most d. When d=2 the problem is TSP, and in this case, the well-known Christofides algorithm provides a 1.5-approximation algorithm (assuming the edge weights satisfy the triangle inequality). In 1984, Christos Papadimitriou and Umesh Vazirani posed the challenge of finding an algorithm with performance guarantee less than 2 for Euclidean graphs (points in R^n) and d > 2. This paper gives the first answer to that challenge, presenting an algorithm to compute a degree-3 spanning tree of cost at most 5/3 times the MST. For points in the plane, the ratio improves to 3/2 and the algorithm can also find a degree-4 spanning tree of cost at most 5/4 times the MST.

preprint2002arXiv

On Strongly Connected Digraphs with Bounded Cycle Length

The MEG (minimum equivalent graph) problem is, given a directed graph, to find a small subset of the edges that maintains all reachability relations between nodes. The problem is NP-hard. This paper gives a proof that, for graphs where each directed cycle has at most three edges, the MEG problem is equivalent to maximum bipartite matching, and therefore solvable in polynomial time. This leads to an improvement in the performance guarantee of the previously best approximation algorithm for the general problem in ``Approximating the Minimum Equivalent Digraph'' (1995).

Samir Khuller

What is connected

Connect this record

See the researcher in context

Building this map preview

21 published item(s)

Balancing Flow Time and Energy Consumption

Constant factor Approximation Algorithms for Uniform Hard Capacitated Facility Location Problems: Natural LP is not too bad

Correlated Stochastic Knapsack with a Submodular Objective

Individual Preference Stability for Clustering

Multi-transversals for Triangles and the Tuza's Conjecture

A Pairwise Fair and Community-preserving Approach to k-Center Clustering

An Algorithm for Multi-Attribute Diverse Matching

LP Rounding and Combinatorial Algorithms for Minimizing Active and Busy Time

Minimizing Uncertainty through Sensor Placement with Angle Constraints

Scheduling Distributed Clusters of Parallel Machines: Primal-Dual and LP-based Approximation Algorithms [Full Version]

On Correcting Inputs: Inverse Optimization for Online Structured Prediction

Analyzing the Optimal Neighborhood: Algorithms for Budgeted and Partial Connected Dominating Set Problems

Data Placement and Replica Selection for Improving Co-location in Distributed Environments

A Model for Minimizing Active Processor Time

LP Rounding for k-Centers with Non-uniform Hard Capacities

A Primal-Dual Parallel Approximation Technique Applied to Weighted Set and Vertex Cover

Approximating the Minimum Equivalent Digraph

Balancing Minimum Spanning and Shortest Path Trees

Designing Multi-Commodity Flow Trees

Low-Degree Spanning Trees of Small Weight

On Strongly Connected Digraphs with Bounded Cycle Length