Source author record

Sungjin Im

Sungjin Im appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Machine Learning Databases Distributed, Parallel, and Cluster Computing Performance

Catalog footprint

What is connected

13works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Parsimonious Learning-Augmented Caching

Learning-augmented algorithms -- in which, traditional algorithms are augmented with machine-learned predictions -- have emerged as a framework to go beyond worst-case analysis. The overarching goal is to design algorithms that perform near-optimally when the predictions are accurate yet retain certain worst-case guarantees irrespective of the accuracy of the predictions. This framework has been successfully applied to online problems such as caching where the predictions can be used to alleviate uncertainties. In this paper we introduce and study the setting in which the learning-augmented algorithm can utilize the predictions parsimoniously. We consider the caching problem -- which has been extensively studied in the learning-augmented setting -- and show that one can achieve quantitatively similar results but only using a sublinear number of predictions.

preprint2020arXiv

A Relational Gradient Descent Algorithm For Support Vector Machine Training

We consider gradient descent like algorithms for Support Vector Machine (SVM) training when the data is in relational form. The gradient of the SVM objective can not be efficiently computed by known techniques as it suffers from the ``subtraction problem''. We first show that the subtraction problem can not be surmounted by showing that computing any constant approximation of the gradient of the SVM objective function is $\#P$-hard, even for acyclic joins. We, however, circumvent the subtraction problem by restricting our attention to stable instances, which intuitively are instances where a nearly optimal solution remains nearly optimal if the points are perturbed slightly. We give an efficient algorithm that computes a ``pseudo-gradient'' that guarantees convergence for stable instances at a rate comparable to that achieved by using the actual gradient. We believe that our results suggest that this sort of stability the analysis would likely yield useful insight in the context of designing algorithms on relational data for other learning problems in which the subtraction problem arises.

preprint2020arXiv

Approximate Aggregate Queries Under Additive Inequalities

We consider the problem of evaluating certain types of functional aggregation queries on relational data subject to additive inequalities. Such aggregation queries, with a smallish number of additive inequalities, arise naturally/commonly in many applications, particularly in learning applications. We give a relatively complete categorization of the computational complexity of such problems. We first show that the problem is NP-hard, even in the case of one additive inequality. Thus we turn to approximating the query. Our main result is an efficient algorithm for approximating, with arbitrarily small relative error, many natural aggregation queries with one additive inequality. We give examples of natural queries that can be efficiently solved using this algorithm. In contrast, we show that the situation with two additive inequalities is quite different, by showing that it is NP-hard to evaluate simple aggregation queries, with two additive inequalities, with any bounded relative error.

preprint2020arXiv

Dynamic Weighted Fairness with Minimal Disruptions

In this paper, we consider the following dynamic fair allocation problem: Given a sequence of job arrivals and departures, the goal is to maintain an approximately fair allocation of the resource against a target fair allocation policy, while minimizing the total number of disruptions, which is the number of times the allocation of any job is changed. We consider a rich class of fair allocation policies that significantly generalize those considered in previous work. We first consider the models where jobs only arrive, or jobs only depart. We present tight upper and lower bounds for the number of disruptions required to maintain a constant approximate fair allocation every time step. In particular, for the canonical case where jobs have weights and the resource allocation is proportional to the job's weight, we show that maintaining a constant approximate fair allocation requires $Θ(\log^* n)$ disruptions per job, almost matching the bounds in prior work for the unit weight case. For the more general setting where the allocation policy only decreases the allocation to a job when new jobs arrive, we show that maintaining a constant approximate fair allocation requires $Θ(\log n)$ disruptions per job. We then consider the model where jobs can both arrive and depart. We first show strong lower bounds on the number of disruptions required to maintain constant approximate fairness for arbitrary instances. In contrast we then show that there there is an algorithm that can maintain constant approximate fairness with $O(1)$ expected disruptions per job if the weights of the jobs are independent of the jobs arrival and departure order. We finally show how our results can be extended to the setting with multiple resources.

preprint2020arXiv

Fast Noise Removal for $k$-Means Clustering

This paper considers $k$-means clustering in the presence of noise. It is known that $k$-means clustering is highly sensitive to noise, and thus noise should be removed to obtain a quality solution. A popular formulation of this problem is called $k$-means clustering with outliers. The goal of $k$-means clustering with outliers is to discard up to a specified number $z$ of points as noise/outliers and then find a $k$-means solution on the remaining data. The problem has received significant attention, yet current algorithms with theoretical guarantees suffer from either high running time or inherent loss in the solution quality. The main contribution of this paper is two-fold. Firstly, we develop a simple greedy algorithm that has provably strong worst case guarantees. The greedy algorithm adds a simple preprocessing step to remove noise, which can be combined with any $k$-means clustering algorithm. This algorithm gives the first pseudo-approximation-preserving reduction from $k$-means with outliers to $k$-means without outliers. Secondly, we show how to construct a coreset of size $O(k \log n)$. When combined with our greedy algorithm, we obtain a scalable, near linear time algorithm. The theoretical contributions are verified experimentally by demonstrating that the algorithm quickly removes noise and obtains a high-quality clustering.

preprint2020arXiv

Weighted Completion Time Minimization for Unrelated Machines via Iterative Fair Contention Resolution

We give a 1.488-approximation for the classic scheduling problem of minimizing total weighted completion time on unrelated machines. This is a considerable improvement on the recent breakthrough of $(1.5 - 10^{-7})$-approximation (STOC 2016, Bansal-Srinivasan-Svensson) and the follow-up result of $(1.5 - 1/6000)$-approximation (FOCS 2017, Li). Bansal et al. introduced a novel rounding scheme yielding strong negative correlations for the first time and applied it to the scheduling problem to obtain their breakthrough, which resolved the open problem if one can beat out the long-standing $1.5$-approximation barrier based on independent rounding. Our key technical contribution is in achieving significantly stronger negative correlations via iterative fair contention resolution, which is of independent interest. Previously, Bansal et al. obtained strong negative correlations via a variant of pipage type rounding and Li used it as a black box.

preprint2016arXiv

Better Unrelated Machine Scheduling for Weighted Completion Time via Random Offsets from Non-Uniform Distributions

In this paper we consider the classic scheduling problem of minimizing total weighted completion time on unrelated machines when jobs have release times, i.e, $R | r_{ij} | \sum_j w_j C_j$ using the three-field notation. For this problem, a 2-approximation is known based on a novel convex programming (J. ACM 2001 by Skutella). It has been a long standing open problem if one can improve upon this 2-approximation (Open Problem 8 in J. of Sched. 1999 by Schuurman and Woeginger). We answer this question in the affirmative by giving a 1.8786-approximation. We achieve this via a surprisingly simple linear programming, but a novel rounding algorithm and analysis. A key ingredient of our algorithm is the use of random offsets sampled from non-uniform distributions. We also consider the preemptive version of the problem, i.e, $R | r_{ij},pmtn | \sum_j w_j C_j$. We again use the idea of sampling offsets from non-uniform distributions to give the first better than 2-approximation for this problem. This improvement also requires use of a configuration LP with variables for each job's complete schedules along with more careful analysis. For both non-preemptive and preemptive versions, we break the approximation barrier of 2 for the first time.

preprint2015arXiv

Tight Bounds for Online Vector Scheduling

Modern data centers face a key challenge of effectively serving user requests that arrive online. Such requests are inherently multi-dimensional and characterized by demand vectors over multiple resources such as processor cycles, storage space, and network bandwidth. Typically, different resources require different objectives to be optimized, and $L_r$ norms of loads are among the most popular objectives considered. To address these problems, we consider the online vector scheduling problem in this paper. Introduced by Chekuri and Khanna (SIAM J of Comp. 2006), vector scheduling is a generalization of classical load balancing, where every job has a vector load instead of a scalar load. In this paper, we resolve the online complexity of the vector scheduling problem and its important generalizations. Our main results are: -For identical machines, we show that the optimal competitive ratio is $Θ(\log d / \log \log d)$ by giving an online lower bound and an algorithm with an asymptotically matching competitive ratio. The lower bound is technically challenging, and is obtained via an online lower bound for the minimum mono-chromatic clique problem using a novel online coloring game and randomized coding scheme. -For unrelated machines, we show that the optimal competitive ratio is $Θ(\log m + \log d)$ by giving an online lower bound that matches a previously known upper bound. Unlike identical machines, however, extending these results, particularly the upper bound, to general $L_r$ norms requires new ideas. In particular, we use a carefully constructed potential function that balances the individual $L_r$ objectives with the overall (convexified) min-max objective to guide the online algorithm and track the changes in potential to bound the competitive ratio.

preprint2014arXiv

Competitive Algorithms from Competitive Equilibria: Non-Clairvoyant Scheduling under Polyhedral Constraints

We introduce and study a general scheduling problem that we term the Packing Scheduling problem. In this problem, jobs can have different arrival times and sizes; a scheduler can process job $j$ at rate $x_j$, subject to arbitrary packing constraints over the set of rates ($\vec{x}$) of the outstanding jobs. The PSP framework captures a variety of scheduling problems, including the classical problems of unrelated machines scheduling, broadcast scheduling, and scheduling jobs of different parallelizability. It also captures scheduling constraints arising in diverse modern environments ranging from individual computer architectures to data centers. More concretely, PSP models multidimensional resource requirements and parallelizability, as well as network bandwidth requirements found in data center scheduling. In this paper, we design non-clairvoyant online algorithms for PSP and its special cases -- in this setting, the scheduler is unaware of the sizes of jobs. Our two main results are, 1) a constant competitive algorithm for minimizing total weighted completion time for PSP and 2)a scalable algorithm for minimizing the total flow-time on unrelated machines, which is a special case of PSP.

preprint2014arXiv

SELFISHMIGRATE: A Scalable Algorithm for Non-clairvoyantly Scheduling Heterogeneous Processors

We consider the classical problem of minimizing the total weighted flow-time for unrelated machines in the online \emph{non-clairvoyant} setting. In this problem, a set of jobs $J$ arrive over time to be scheduled on a set of $M$ machines. Each job $j$ has processing length $p_j$, weight $w_j$, and is processed at a rate of $\ell_{ij}$ when scheduled on machine $i$. The online scheduler knows the values of $w_j$ and $\ell_{ij}$ upon arrival of the job, but is not aware of the quantity $p_j$. We present the {\em first} online algorithm that is {\em scalable} ($(1+\eps)$-speed $O(\frac{1}{ε^2})$-competitive for any constant $\eps > 0$) for the total weighted flow-time objective. No non-trivial results were known for this setting, except for the most basic case of identical machines. Our result resolves a major open problem in online scheduling theory. Moreover, we also show that no job needs more than a logarithmic number of migrations. We further extend our result and give a scalable algorithm for the objective of minimizing total weighted flow-time plus energy cost for the case of unrelated machines and obtain a scalable algorithm. The key algorithmic idea is to let jobs migrate selfishly until they converge to an equilibrium. Towards this end, we define a game where each job's utility which is closely tied to the instantaneous increase in the objective the job is responsible for, and each machine declares a policy that assigns priorities to jobs based on when they migrate to it, and the execution speeds. This has a spirit similar to coordination mechanisms that attempt to achieve near optimum welfare in the presence of selfish agents (jobs). To the best our knowledge, this is the first work that demonstrates the usefulness of ideas from coordination mechanisms and Nash equilibria for designing and analyzing online algorithms.

preprint2013arXiv

Minimum Latency Submodular Cover

We study the Minimum Latency Submodular Cover problem (MLSC), which consists of a metric $(V,d)$ with source $r\in V$ and $m$ monotone submodular functions $f_1, f_2, ..., f_m: 2^V \rightarrow [0,1]$. The goal is to find a path originating at $r$ that minimizes the total cover time of all functions. This generalizes well-studied problems, such as Submodular Ranking [AzarG11] and Group Steiner Tree [GKR00]. We give a polynomial time $O(\log \frac{1}{\eps} \cdot \log^{2+δ} |V|)$-approximation algorithm for MLSC, where $ε>0$ is the smallest non-zero marginal increase of any $\{f_i\}_{i=1}^m$ and $δ>0$ is any constant. We also consider the Latency Covering Steiner Tree problem (LCST), which is the special case of \mlsc where the $f_i$s are multi-coverage functions. This is a common generalization of the Latency Group Steiner Tree [GuptaNR10a,ChakrabartyS11] and Generalized Min-sum Set Cover [AzarGY09, BansalGK10] problems. We obtain an $O(\log^2|V|)$-approximation algorithm for LCST. Finally we study a natural stochastic extension of the Submodular Ranking problem, and obtain an adaptive algorithm with an $O(\log 1/ \eps)$ approximation ratio, which is best possible. This result also generalizes some previously studied stochastic optimization problems, such as Stochastic Set Cover [GoemansV06] and Shared Filter Evaluation [MunagalaSW07, LiuPRY08].

preprint2013arXiv

Optimizing Maximum Flow Time and Maximum Throughput in Broadcast Scheduling

We consider the pull-based broadcast scheduling model. In this model, there are n unit-sized pages of information available at the server. Requests arrive over time at the server asking for a specific page. When the server transmits a page, all outstanding requests for the page are simultaneously satisfied, and this is what distinguishes broadcast scheduling from the standard scheduling setting where each job must be processed separately by the server. Broadcast scheduling has received a considerable amount of attention due to the algorithmic challenges that it gives in addition to its applications in multicast systems and wireless and LAN networks. In this paper, we give the following new approximation results for two popular objectives: - For the objective of minimizing the maximum flow time, we give the first PTAS. Previously, it was known that the algorithm First-In-First-Out (FIFO) is a 2-approximation, and it is tight. It has been suggested as an open problem to obtain a better approximation. - For the objective of maximizing the throughput, we give a 0.7759-approximation which improves upon the previous best known 0.75-approximation. Our improved results are enabled by our novel rounding schemes and linear programming which can effectively reduce congestion in schedule which is often the main bottleneck in designing scheduling algorithms based on linear programming. We believe that our algorithmic ideas and techniques could be of potential use for other scheduling problems.

preprint2011arXiv

Fast Clustering using MapReduce

Clustering problems have numerous applications and are becoming more challenging as the size of the data increases. In this paper, we consider designing clustering algorithms that can be used in MapReduce, the most popular programming environment for processing large datasets. We focus on the practical and popular clustering problems, $k$-center and $k$-median. We develop fast clustering algorithms with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis that shows several clustering algorithms are in $\mathcal{MRC}^0$, a theoretical MapReduce class introduced by Karloff et al. \cite{KarloffSV10}. Our algorithms use sampling to decrease the data size and they run a time consuming clustering algorithm such as local search or Lloyd's algorithm on the resulting data set. Our algorithms have sufficient flexibility to be used in practice since they run in a constant number of MapReduce rounds. We complement these results by performing experiments using our algorithms. We compare the empirical performance of our algorithms to several sequential and parallel algorithms for the $k$-median problem. The experiments show that our algorithms' solutions are similar to or better than the other algorithms' solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than the other parallel algorithms that we tested.

Sungjin Im

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Parsimonious Learning-Augmented Caching

A Relational Gradient Descent Algorithm For Support Vector Machine Training

Approximate Aggregate Queries Under Additive Inequalities

Dynamic Weighted Fairness with Minimal Disruptions

Fast Noise Removal for $k$-Means Clustering

Weighted Completion Time Minimization for Unrelated Machines via Iterative Fair Contention Resolution

Better Unrelated Machine Scheduling for Weighted Completion Time via Random Offsets from Non-Uniform Distributions

Tight Bounds for Online Vector Scheduling

Competitive Algorithms from Competitive Equilibria: Non-Clairvoyant Scheduling under Polyhedral Constraints

SELFISHMIGRATE: A Scalable Algorithm for Non-clairvoyantly Scheduling Heterogeneous Processors

Minimum Latency Submodular Cover

Optimizing Maximum Flow Time and Maximum Throughput in Broadcast Scheduling

Fast Clustering using MapReduce