Researcher profile

Dorit S. Hochbaum

Dorit S. Hochbaum contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2026arXiv

A Fast and Effective Method for Euclidean Anticlustering: The Assignment-Based-Anticlustering Algorithm

The anticlustering problem is to partition a set of objects into K equal-sized anticlusters such that the sum of distances within anticlusters is maximized. The anticlustering problem is NP-hard. We focus on anticlustering in Euclidean spaces, where the input data is tabular and each object is represented as a D-dimensional feature vector. Distances are measured as squared Euclidean distances between the respective vectors. Applications of Euclidean anticlustering include social studies, particularly in psychology, K-fold cross-validation in which each fold should be a good representative of the entire dataset, the creation of mini-batches for gradient descent in neural network training, and balanced K-cut partitioning. In particular, machine-learning applications involve million-scale datasets and very large values of K, making scalable anticlustering algorithms essential. Existing algorithms are either exact methods that can solve only small instances or heuristic methods, among which the most scalable is the exchange-based heuristic fast_anticlustering. We propose a new algorithm, the Assignment-Based Anticlustering algorithm (ABA), which scales to very large instances. A computational study shows that ABA outperforms fast_anticlustering in both solution quality and running time. Moreover, ABA scales to instances with millions of objects and hundreds of thousands of anticlusters within short running times, beyond what fast_anticlustering can handle. As a balanced K-cut partitioning method for tabular data, ABA is superior to the well-known METIS method in both solution quality and running time. The code of the ABA algorithm is available on GitHub.

preprint2021arXiv

A Graph-Theoretic Approach for Spatial Filtering and Its Impact on Mixed-type Spatial Pattern Recognition in Wafer Bin Maps

Statistical quality control in semiconductor manufacturing hinges on effective diagnostics of wafer bin maps, wherein a key challenge is to detect how defective chips tend to spatially cluster on a wafer--a problem known as spatial pattern recognition. Recently, there has been a growing interest in mixed-type spatial pattern recognition--when multiple defect patterns, of different shapes, co-exist on the same wafer. Mixed-type spatial pattern recognition entails two central tasks: (1) spatial filtering, to distinguish systematic patterns from random noises; and (2) spatial clustering, to group filtered patterns into distinct defect types. Observing that spatial filtering is instrumental to high-quality mixed-type pattern recognition, we propose to use a graph-theoretic method, called adjacency-clustering, which leverages spatial dependence among adjacent defective chips to effectively filter the raw wafer maps. Tested on real-world data and compared against a state-of the-art approach, our proposed method achieves at least 46% gain in terms of internal cluster validation quality (i.e., validation without external class labels), and about ~5% gain in terms of Normalized Mutual Information--an external cluster validation metric based on external class labels. Interestingly, the margin of improvement appears to be a function of the pattern complexity, with larger gains achieved for more complex-shaped patterns.

preprint2021arXiv

Joint aggregation of cardinal and ordinal evaluations with an application to a student paper competition

An important problem in decision theory concerns the aggregation of individual rankings/ratings into a collective evaluation. We illustrate a new aggregation method in the context of the 2007 MSOM's student paper competition. The aggregation problem in this competition poses two challenges. Firstly, each paper was reviewed only by a very small fraction of the judges; thus the aggregate evaluation is highly sensitive to the subjective scales chosen by the judges. Secondly, the judges provided both cardinal and ordinal evaluations (ratings and rankings) of the papers they reviewed. The contribution here is a new robust methodology that jointly aggregates ordinal and cardinal evaluations into a collective evaluation. This methodology is particularly suitable in cases of incomplete evaluations -- i.e., when the individuals evaluate only a strict subset of the objects. This approach is potentially useful in managerial decision making problems by a committee selecting projects from a large set or capital budgeting involving multiple priorities.

preprint2020arXiv

Algorithms and Complexity for Variants of Covariates Fine Balance

We study here several variants of the covariates fine balance problem where we generalize some of these problems and introduce a number of others. We present here a comprehensive complexity study of the covariates problems providing polynomial time algorithms, or a proof of NP-hardness. The polynomial time algorithms described are mostly combinatorial and rely on network flow techniques. In addition we present several fixed-parameter tractable results for problems where the number of covariates and the number of levels of each covariate are seen as a parameter.

preprint2020arXiv

Network Flow Methods for the Minimum Covariates Imbalance Problem

The problem of balancing covariates arises in observational studies where one is given a group of control samples and another group, disjoint from the control group, of treatment samples. Each sample, in either group, has several observed nominal covariates. The values, or categories, of each covariate partition the treatment and control samples to a number of subsets referred to as \textit{levels} where the samples at every level share the same covariate value. We address here a problem of selecting a subset of the control group so as to balance, to the best extent possible, the sizes of the levels between the treatment group and the selected subset of control group, the min-imbalance problem. It is proved here that the min-imbalance problem, on two covariates, is solved efficiently with network flow techniques. We present an integer programming formulation of the problem where the constraint matrix is totally unimodular, implying that the linear programming relaxation to the problem has all basic solutions, and in particular the optimal solution, integral. This integer programming formulation is linked to a minimum cost network flow problem which is solvable in $O(n\cdot (n' + n\log n))$ steps, for $n$ the size of the treatment group and $n'$ the size of the control group. A more efficient algorithm is further devised based on an alternative, maximum flow, formulation of the two-covariate min-imbalance problem, that runs in $O(n'^{3/2}\log^2n)$ steps.

preprint2020arXiv

The Max-Cut Decision Tree: Improving on the Accuracy and Running Time of Decision Trees

Decision trees are a widely used method for classification, both by themselves and as the building blocks of multiple different ensemble learning methods. The Max-Cut decision tree involves novel modifications to a standard, baseline model of classification decision tree construction, precisely CART Gini. One modification involves an alternative splitting metric, maximum cut, based on maximizing the distance between all pairs of observations belonging to separate classes and separate sides of the threshold value. The other modification is to select the decision feature from a linear combination of the input features constructed using Principal Component Analysis (PCA) locally at each node. Our experiments show that this node-based localized PCA with the novel splitting modification can dramatically improve classification, while also significantly decreasing computational time compared to the baseline decision tree. Moreover, our results are most significant when evaluated on data sets with higher dimensions, or more classes; which, for the example data set CIFAR-100, enable a 49% improvement in accuracy while reducing CPU time by 94%. These introduced modifications dramatically advance the capabilities of decision trees for difficult classification tasks.

preprint2011arXiv

Benchmark Problems for Totally Unimodular Set System Auction

We consider a generalization of the $k$-flow set system auction where the set to be procured by a customer corresponds to a feasible solution to a linear programming problem where the coefficient matrix and right-hand-side together constitute a totally unimodular matrix. Our results generalize and strengthen bounds identified for several benchmarks, which form a crucial component in the study of frugality ratios of truthful auction mechanisms.

preprint2011arXiv

Practical and theoretical improvements for bipartite matching using the pseudoflow algorithm

We show that the pseudoflow algorithm for maximum flow is particularly efficient for the bipartite matching problem both in theory and in practice. We develop several implementations of the pseudoflow algorithm for bipartite matching, and compare them over a wide set of benchmark instances to state-of-the-art implementations of push-relabel and augmenting path algorithms that are specifically designed to solve these problems. The experiments show that the pseudoflow variants are in most cases faster than the other algorithms. We also show that one particular implementation---the matching pseudoflow algorithm---is theoretically efficient. For a graph with $n$ nodes, $m$ arcs, $n_1$ the size of the smaller set in the bipartition, and the maximum matching value $κ\leq n_1$, the algorithm's complexity given input in the form of adjacency lists is $O(\min{n_1κ,m} + \sqrtκ\min{κ^2,m})$. Similar algorithmic ideas are shown to work for an adaptation of Hopcroft and Karp's bipartite matching algorithm with the same complexity. Using boolean operations on words of size $λ$, the complexity of the pseudoflow algorithm is further improved to $O(\min{n_1κ, \frac{n_1n_2}λ, m} + κ^2 + \frac{κ^{2.5}}λ)$. This run time is faster than for previous algorithms such as Cheriyan and Mehlhorn's algorithm of complexity $O(\frac{n^{2.5}}λ)$.

preprint2008arXiv

Polynomial time algorithms for bi-criteria, multi-objective and ratio problems in clustering and imaging. Part I: Normalized cut and ratio regions

Partitioning and grouping of similar objects plays a fundamental role in image segmentation and in clustering problems. In such problems a typical goal is to group together similar objects, or pixels in the case of image processing. At the same time another goal is to have each group distinctly dissimilar from the rest and possibly to have the group size fairly large. These goals are often combined as a ratio optimization problem. One example of such problem is the normalized cut problem, another is the ratio regions problem. We devise here the first polynomial time algorithms solving these problems optimally. The algorithms are efficient and combinatorial. This contrasts with the heuristic approaches used in the image segmentation literature that formulate those problems as nonlinear optimization problems, which are then relaxed and solved with spectral techniques in real numbers. These approaches not only fail to deliver an optimal solution, but they are also computationally expensive. The algorithms presented here use as a subroutine a minimum $s,t-cut procedure on a related graph which is of polynomial size. The output consists of the optimal solution to the respective ratio problem, as well as a sequence of nested solution with respect to any relative weighting of the objectives of the numerator and denominator. An extension of the results here to bi-criteria and multi-criteria objective functions is presented in part II.