Source author record

Dorit S. Hochbaum

Dorit S. Hochbaum appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Discrete Mathematics math.OC Data Structures and Algorithms math.CO Computational Complexity Computer Vision Machine Learning Applications Artificial Intelligence Computer Science and Game Theory Information Theory math.IT math.ST Networking and Internet Architecture Statistics Theory

Catalog footprint

What is connected

12works

15topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Fast and Effective Method for Euclidean Anticlustering: The Assignment-Based-Anticlustering Algorithm

The anticlustering problem is to partition a set of objects into K equal-sized anticlusters such that the sum of distances within anticlusters is maximized. The anticlustering problem is NP-hard. We focus on anticlustering in Euclidean spaces, where the input data is tabular and each object is represented as a D-dimensional feature vector. Distances are measured as squared Euclidean distances between the respective vectors. Applications of Euclidean anticlustering include social studies, particularly in psychology, K-fold cross-validation in which each fold should be a good representative of the entire dataset, the creation of mini-batches for gradient descent in neural network training, and balanced K-cut partitioning. In particular, machine-learning applications involve million-scale datasets and very large values of K, making scalable anticlustering algorithms essential. Existing algorithms are either exact methods that can solve only small instances or heuristic methods, among which the most scalable is the exchange-based heuristic fast_anticlustering. We propose a new algorithm, the Assignment-Based Anticlustering algorithm (ABA), which scales to very large instances. A computational study shows that ABA outperforms fast_anticlustering in both solution quality and running time. Moreover, ABA scales to instances with millions of objects and hundreds of thousands of anticlusters within short running times, beyond what fast_anticlustering can handle. As a balanced K-cut partitioning method for tabular data, ABA is superior to the well-known METIS method in both solution quality and running time. The code of the ABA algorithm is available on GitHub.

preprint2021arXiv

A Graph-Theoretic Approach for Spatial Filtering and Its Impact on Mixed-type Spatial Pattern Recognition in Wafer Bin Maps

Statistical quality control in semiconductor manufacturing hinges on effective diagnostics of wafer bin maps, wherein a key challenge is to detect how defective chips tend to spatially cluster on a wafer--a problem known as spatial pattern recognition. Recently, there has been a growing interest in mixed-type spatial pattern recognition--when multiple defect patterns, of different shapes, co-exist on the same wafer. Mixed-type spatial pattern recognition entails two central tasks: (1) spatial filtering, to distinguish systematic patterns from random noises; and (2) spatial clustering, to group filtered patterns into distinct defect types. Observing that spatial filtering is instrumental to high-quality mixed-type pattern recognition, we propose to use a graph-theoretic method, called adjacency-clustering, which leverages spatial dependence among adjacent defective chips to effectively filter the raw wafer maps. Tested on real-world data and compared against a state-of the-art approach, our proposed method achieves at least 46% gain in terms of internal cluster validation quality (i.e., validation without external class labels), and about ~5% gain in terms of Normalized Mutual Information--an external cluster validation metric based on external class labels. Interestingly, the margin of improvement appears to be a function of the pattern complexity, with larger gains achieved for more complex-shaped patterns.

preprint2021arXiv

Joint aggregation of cardinal and ordinal evaluations with an application to a student paper competition

An important problem in decision theory concerns the aggregation of individual rankings/ratings into a collective evaluation. We illustrate a new aggregation method in the context of the 2007 MSOM's student paper competition. The aggregation problem in this competition poses two challenges. Firstly, each paper was reviewed only by a very small fraction of the judges; thus the aggregate evaluation is highly sensitive to the subjective scales chosen by the judges. Secondly, the judges provided both cardinal and ordinal evaluations (ratings and rankings) of the papers they reviewed. The contribution here is a new robust methodology that jointly aggregates ordinal and cardinal evaluations into a collective evaluation. This methodology is particularly suitable in cases of incomplete evaluations -- i.e., when the individuals evaluate only a strict subset of the objects. This approach is potentially useful in managerial decision making problems by a committee selecting projects from a large set or capital budgeting involving multiple priorities.

preprint2020arXiv

Algorithms and Complexity for Variants of Covariates Fine Balance

We study here several variants of the covariates fine balance problem where we generalize some of these problems and introduce a number of others. We present here a comprehensive complexity study of the covariates problems providing polynomial time algorithms, or a proof of NP-hardness. The polynomial time algorithms described are mostly combinatorial and rely on network flow techniques. In addition we present several fixed-parameter tractable results for problems where the number of covariates and the number of levels of each covariate are seen as a parameter.

preprint2020arXiv

Network Flow Methods for the Minimum Covariates Imbalance Problem

The problem of balancing covariates arises in observational studies where one is given a group of control samples and another group, disjoint from the control group, of treatment samples. Each sample, in either group, has several observed nominal covariates. The values, or categories, of each covariate partition the treatment and control samples to a number of subsets referred to as \textit{levels} where the samples at every level share the same covariate value. We address here a problem of selecting a subset of the control group so as to balance, to the best extent possible, the sizes of the levels between the treatment group and the selected subset of control group, the min-imbalance problem. It is proved here that the min-imbalance problem, on two covariates, is solved efficiently with network flow techniques. We present an integer programming formulation of the problem where the constraint matrix is totally unimodular, implying that the linear programming relaxation to the problem has all basic solutions, and in particular the optimal solution, integral. This integer programming formulation is linked to a minimum cost network flow problem which is solvable in $O(n\cdot (n' + n\log n))$ steps, for $n$ the size of the treatment group and $n'$ the size of the control group. A more efficient algorithm is further devised based on an alternative, maximum flow, formulation of the two-covariate min-imbalance problem, that runs in $O(n'^{3/2}\log^2n)$ steps.

preprint2020arXiv

The Max-Cut Decision Tree: Improving on the Accuracy and Running Time of Decision Trees

Decision trees are a widely used method for classification, both by themselves and as the building blocks of multiple different ensemble learning methods. The Max-Cut decision tree involves novel modifications to a standard, baseline model of classification decision tree construction, precisely CART Gini. One modification involves an alternative splitting metric, maximum cut, based on maximizing the distance between all pairs of observations belonging to separate classes and separate sides of the threshold value. The other modification is to select the decision feature from a linear combination of the input features constructed using Principal Component Analysis (PCA) locally at each node. Our experiments show that this node-based localized PCA with the novel splitting modification can dramatically improve classification, while also significantly decreasing computational time compared to the baseline decision tree. Moreover, our results are most significant when evaluated on data sets with higher dimensions, or more classes; which, for the example data set CIFAR-100, enable a 49% improvement in accuracy while reducing CPU time by 94%. These introduced modifications dramatically advance the capabilities of decision trees for difficult classification tasks.

preprint2012arXiv

Multiflow Transmission in Delay Constrained Cooperative Wireless Networks

This paper considers the problem of energy-efficient transmission in multi-flow multihop cooperative wireless networks. Although the performance gains of cooperative approaches are well known, the combinatorial nature of these schemes makes it difficult to design efficient polynomial-time algorithms for joint routing, scheduling and power control. This becomes more so when there is more than one flow in the network. It has been conjectured by many authors, in the literature, that the multiflow problem in cooperative networks is an NP-hard problem. In this paper, we formulate the problem, as a combinatorial optimization problem, for a general setting of $k$-flows, and formally prove that the problem is not only NP-hard but it is $o(n^{1/7-ε})$ inapproxmiable. To our knowledge*, these results provide the first such inapproxmiablity proof in the context of multiflow cooperative wireless networks. We further prove that for a special case of k = 1 the solution is a simple path, and devise a polynomial time algorithm for jointly optimizing routing, scheduling and power control. We then use this algorithm to establish analytical upper and lower bounds for the optimal performance for the general case of $k$ flows. Furthermore, we propose a polynomial time heuristic for calculating the solution for the general case and evaluate the performance of this heuristic under different channel conditions and against the analytical upper and lower bounds.

preprint2011arXiv

Benchmark Problems for Totally Unimodular Set System Auction

We consider a generalization of the $k$-flow set system auction where the set to be procured by a customer corresponds to a feasible solution to a linear programming problem where the coefficient matrix and right-hand-side together constitute a totally unimodular matrix. Our results generalize and strengthen bounds identified for several benchmarks, which form a crucial component in the study of frugality ratios of truthful auction mechanisms.

preprint2011arXiv

Practical and theoretical improvements for bipartite matching using the pseudoflow algorithm

We show that the pseudoflow algorithm for maximum flow is particularly efficient for the bipartite matching problem both in theory and in practice. We develop several implementations of the pseudoflow algorithm for bipartite matching, and compare them over a wide set of benchmark instances to state-of-the-art implementations of push-relabel and augmenting path algorithms that are specifically designed to solve these problems. The experiments show that the pseudoflow variants are in most cases faster than the other algorithms. We also show that one particular implementation---the matching pseudoflow algorithm---is theoretically efficient. For a graph with $n$ nodes, $m$ arcs, $n_1$ the size of the smaller set in the bipartition, and the maximum matching value $κ\leq n_1$, the algorithm's complexity given input in the form of adjacency lists is $O(\min{n_1κ,m} + \sqrtκ\min{κ^2,m})$. Similar algorithmic ideas are shown to work for an adaptation of Hopcroft and Karp's bipartite matching algorithm with the same complexity. Using boolean operations on words of size $λ$, the complexity of the pseudoflow algorithm is further improved to $O(\min{n_1κ, \frac{n_1n_2}λ, m} + κ^2 + \frac{κ^{2.5}}λ)$. This run time is faster than for previous algorithms such as Cheriyan and Mehlhorn's algorithm of complexity $O(\frac{n^{2.5}}λ)$.

preprint2010arXiv

Competitive Analysis of Minimum-Cut Maximum Flow Algorithms in Vision Problems

Rapid advances in image acquisition and storage technology underline the need for algorithms that are capable of solving large scale image processing and computer-vision problems. The minimum cut problem plays an important role in processing many of these imaging problems such as, image and video segmentation, stereo vision, multi-view reconstruction and surface fitting. While several min-cut/max-flow algorithms can be found in the literature, their performance in practice has been studied primarily outside the scope of computer vision. We present here the results of a comprehensive computational study, in terms of execution times and memory utilization, of four recently published algorithms, which optimally solve the {\em s-t} cut and maximum flow problems: (i) Goldberg's and Tarjan's {\em Push-Relabel}; (ii) Hochbaum's {\em pseudoflow}; (iii) Boykov's and Kolmogorov's {\em augmenting paths}; and (iv) Goldberg's {\em partial augment-relabel}. Our results demonstrate that the {\em Hochbaum's pseudoflow} algorithm, is faster and utilizes less memory than the other algorithms on all problem instances investigated.

preprint2010arXiv

Replacing spectral techniques for expander ratio, normalized cut and conductance by combinatorial flow algorithms

Several challenging problem in clustering, partitioning and imaging have traditionally been solved using the "spectral technique". These problems include the normalized cut problem, the graph expander ratio problem, the Cheeger constant problem and the conductance problem. These problems share several common features: all seek a bipartition of a set of elements; the problems are formulated as a form of ratio cut; the formulation as discrete optimization is shown here to be equivalent to a quadratic ratio, sometimes referred to as the Raleigh ratio, on discrete variables and a single sum constraint which we call the balance or orthogonality constraint; when the discrete nature of the variables is disregarded, the continuous relaxation is solved by the spectral method. Indeed the spectral relaxation technique is a dominant method providing an approximate solution to these problems. We propose an algorithm for these problems which involves a relaxation of the orthogonality constraint only. This relaxation is shown here to be solved optimally, and in strongly polynomial time, in O(mn log((n^2) / m) for a graph on $n$ nodes and $m$ edges. The algorithm, using HPF (Hochbaum's Pseudo-Flow) as subroutine, is efficient enough to be used to solve these bi-partitioning problems on millions of elements and more than 300 million edges within less than 10 minutes. It is also demonstrated, via a preliminary experimental study, that the results of the combinatorial algorithm proposed often improve dramatically on the quality of the results of the spectral method.

preprint2008arXiv

Polynomial time algorithms for bi-criteria, multi-objective and ratio problems in clustering and imaging. Part I: Normalized cut and ratio regions

Partitioning and grouping of similar objects plays a fundamental role in image segmentation and in clustering problems. In such problems a typical goal is to group together similar objects, or pixels in the case of image processing. At the same time another goal is to have each group distinctly dissimilar from the rest and possibly to have the group size fairly large. These goals are often combined as a ratio optimization problem. One example of such problem is the normalized cut problem, another is the ratio regions problem. We devise here the first polynomial time algorithms solving these problems optimally. The algorithms are efficient and combinatorial. This contrasts with the heuristic approaches used in the image segmentation literature that formulate those problems as nonlinear optimization problems, which are then relaxed and solved with spectral techniques in real numbers. These approaches not only fail to deliver an optimal solution, but they are also computationally expensive. The algorithms presented here use as a subroutine a minimum $s,t-cut procedure on a related graph which is of polynomial size. The output consists of the optimal solution to the respective ratio problem, as well as a sequence of nested solution with respect to any relative weighting of the objectives of the numerator and denominator. An extension of the results here to bi-criteria and multi-criteria objective functions is presented in part II.

Dorit S. Hochbaum

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

A Fast and Effective Method for Euclidean Anticlustering: The Assignment-Based-Anticlustering Algorithm

A Graph-Theoretic Approach for Spatial Filtering and Its Impact on Mixed-type Spatial Pattern Recognition in Wafer Bin Maps

Joint aggregation of cardinal and ordinal evaluations with an application to a student paper competition

Algorithms and Complexity for Variants of Covariates Fine Balance

Network Flow Methods for the Minimum Covariates Imbalance Problem

The Max-Cut Decision Tree: Improving on the Accuracy and Running Time of Decision Trees

Multiflow Transmission in Delay Constrained Cooperative Wireless Networks

Benchmark Problems for Totally Unimodular Set System Auction

Practical and theoretical improvements for bipartite matching using the pseudoflow algorithm

Competitive Analysis of Minimum-Cut Maximum Flow Algorithms in Vision Problems

Replacing spectral techniques for expander ratio, normalized cut and conductance by combinatorial flow algorithms

Polynomial time algorithms for bi-criteria, multi-objective and ratio problems in clustering and imaging. Part I: Normalized cut and ratio regions