Source author record

Suresh Venkatasubramanian

Suresh Venkatasubramanian appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computational Geometry cs.CY Data Structures and Algorithms Computational Complexity Computer Vision Databases Neural and Evolutionary Computing Applications Artificial Intelligence Discrete Mathematics Distributed, Parallel, and Cluster Computing General Literature Information Theory math.IT Networking and Internet Architecture Social and Information Networks

Catalog footprint

What is connected

31works

17topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Bridging Prediction and Intervention Problems in Social Systems

Many automated decision systems (ADS) are designed to solve prediction problems -- where the goal is to learn patterns from a sample of the population and apply them to individuals from the same population. In reality, these prediction systems operationalize holistic policy interventions in deployment. Once deployed, ADS can shape impacted population outcomes through an effective policy change in how decision-makers operate, while also being defined by past and present interactions between stakeholders and the limitations of existing organizational, as well as societal, infrastructure and context. In this work, we consider the ways in which we must shift from a prediction-focused paradigm to an intervention-oriented paradigm when considering the impact of ADS within social systems. We argue this requires a new default problem setup for ADS beyond prediction, to instead consider predictions as decision support, final decisions, and outcomes. We highlight how this perspective unifies modern statistical frameworks and other tools to study the design, implementation, and evaluation of ADS systems, and point to the research directions necessary to operationalize this paradigm shift. Using these tools, we characterize the limitations of focusing on isolated prediction tasks, and lay the foundation for a more intervention-oriented approach to developing and deploying ADS.

preprint2026arXiv

The Commodification of AI Sovereignty: Lessons from the Fight for Sovereign Oil

"Sovereignty" is increasingly a part of national AI policies and strategies. At the same time that "sovereignty" is invoked as a priority for global AI policy, it is also being commodified along the AI stack. Companies now sell "sovereign" AI factories, clouds, and language models to governments, enterprises, and communities -- turning a contested value into a commercial commodity. This shift risks allowing private technology providers to define sovereignty on their own terms. By analyzing the history of sovereignty and parallels in global oil production, this paper aims to open avenues to interrogate the implications of this value's commercialization. The contributions of this paper lie in a disentangling of the facets of sovereignty being appealed to through the AI stack and a case for how analogizing oil and AI can be generative in thinking through what is achieved and what can be achieved through the commodification of AI sovereignty.

preprint2022arXiv

It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

Risk assessment instrument (RAI) datasets, particularly ProPublica's COMPAS dataset, are commonly used in algorithmic fairness papers due to benchmarking practices of comparing algorithms on datasets used in prior work. In many cases, this data is used as a benchmark to demonstrate good performance without accounting for the complexities of criminal justice (CJ) processes. However, we show that pretrial RAI datasets can contain numerous measurement biases and errors, and due to disparities in discretion and deployment, algorithmic fairness applied to RAI datasets is limited in making claims about real-world outcomes. These reasons make the datasets a poor fit for benchmarking under assumptions of ground truth and real-world impact. Furthermore, conventional practices of simply replicating previous data experiments may implicitly inherit or edify normative positions without explicitly interrogating value-laden assumptions. Without context of how interdisciplinary fields have engaged in CJ research and context of how RAIs operate upstream and downstream, algorithmic fairness practices are misaligned for meaningful contribution in the context of CJ, and would benefit from transparent engagement with normative considerations and values related to fairness, justice, and equality. These factors prompt questions about whether benchmarks for intrinsically socio-technical systems like the CJ system can exist in a beneficial and ethical way.

preprint2022arXiv

Measuring and mitigating voting access disparities: a study of race and polling locations in Florida and North Carolina

Voter suppression and associated racial disparities in access to voting are long-standing civil rights concerns in the United States. Barriers to voting have taken many forms over the decades. A history of violent explicit discouragement has shifted to more subtle access limitations that can include long lines and wait times, long travel times to reach a polling station, and other logistical barriers to voting. Our focus in this work is on quantifying disparities in voting access pertaining to the overall time-to-vote, and how they could be remedied via a better choice of polling location or provisioning more sites where voters can cast ballots. However, appropriately calibrating access disparities is difficult because of the need to account for factors such as population density and different community expectations for reasonable travel times. In this paper, we quantify access to polling locations, developing a methodology for the calibrated measurement of racial disparities in polling location "load" and distance to polling locations. We apply this methodology to a study of real-world data from Florida and North Carolina to identify disparities in voting access from the 2020 election. We also introduce algorithms, with modifications to handle scale, that can reduce these disparities by suggesting new polling locations from a given list of identified public locations (including schools and libraries). Applying these algorithms on the 2020 election location data also helps to expose and explore tradeoffs between the cost of allocating more polling locations and the potential impact on access disparities. The developed voting access measurement methodology and algorithmic remediation technique is a first step in better polling location assignment.

preprint2021arXiv

A Research Ecosystem for Secure Computing

Computing devices are vital to all areas of modern life and permeate every aspect of our society. The ubiquity of computing and our reliance on it has been accelerated and amplified by the COVID-19 pandemic. From education to work environments to healthcare to defense to entertainment - it is hard to imagine a segment of modern life that is not touched by computing. The security of computers, systems, and applications has been an active area of research in computer science for decades. However, with the confluence of both the scale of interconnected systems and increased adoption of artificial intelligence, there are many research challenges the community must face so that our society can continue to benefit and risks are minimized, not multiplied. Those challenges range from security and trust of the information ecosystem to adversarial artificial intelligence and machine learning. Along with basic research challenges, more often than not, securing a system happens after the design or even deployment, meaning the security community is routinely playing catch-up and attempting to patch vulnerabilities that could be exploited any minute. While security measures such as encryption and authentication have been widely adopted, questions of security tend to be secondary to application capability. There needs to be a sea-change in the way we approach this critically important aspect of the problem: new incentives and education are at the core of this change. Now is the time to refocus research community efforts on developing interconnected technologies with security "baked in by design" and creating an ecosystem that ensures adoption of promising research developments. To realize this vision, two additional elements of the ecosystem are necessary - proper incentive structures for adoption and an educated citizenry that is well versed in vulnerabilities and risks.

preprint2021arXiv

Fair clustering via equitable group representations

What does it mean for a clustering to be fair? One popular approach seeks to ensure that each cluster contains groups in (roughly) the same proportion in which they exist in the population. The normative principle at play is balance: any cluster might act as a representative of the data, and thus should reflect its diversity. But clustering also captures a different form of representativeness. A core principle in most clustering problems is that a cluster center should be representative of the cluster it represents, by being "close" to the points associated with it. This is so that we can effectively replace the points by their cluster centers without significant loss in fidelity, and indeed is a common "use case" for clustering. For such a clustering to be fair, the centers should "represent" different groups equally well. We call such a clustering a group-representative clustering. In this paper, we study the structure and computation of group-representative clusterings. We show that this notion naturally parallels the development of fairness notions in classification, with direct analogs of ideas like demographic parity and equal opportunity. We demonstrate how these notions are distinct from and cannot be captured by balance-based notions of fairness. We present approximation algorithms for group representative $k$-median clustering and couple this with an empirical evaluation on various real-world data sets.

preprint2020arXiv

Evolving Methods for Evaluating and Disseminating Computing Research

Social and technical trends have significantly changed methods for evaluating and disseminating computing research. Traditional venues for reviewing and publishing, such as conferences and journals, worked effectively in the past. Recently, trends have created new opportunities but also put new pressures on the process of review and dissemination. For example, many conferences have seen large increases in the number of submissions. Likewise, dissemination of research ideas has become dramatically through publication venues such as arXiv.org and social media networks. While these trends predate COVID-19, the pandemic could accelerate longer term changes. Based on interviews with leading academics in computing research, our findings include: (1) Trends impacting computing research are largely positive and have increased the participation, scope, accessibility, and speed of the research process. (2) Challenges remain in securing the integrity of the process, including addressing ways to scale the review process, avoiding attempts to misinform or confuse the dissemination of results, and ensuring fairness and broad participation in the process itself. Based on these findings, we recommend: (1) Regularly polling members of the computing research community, including program and general conference chairs, journal editors, authors, reviewers, etc., to identify specific challenges they face to better understand these issues. (2) An influential body, such as the Computing Research Association regularly issues a "State of the Computing Research Enterprise" report to update the community on trends, both positive and negative, impacting the computing research enterprise. (3) A deeper investigation, specifically to better understand the influence that social media and preprint archives have on computing research, is conducted.

preprint2020arXiv

Problems with Shapley-value-based explanations as feature importance measures

Game-theoretic formulations of feature importance have become popular as a way to "explain" machine learning models. These methods define a cooperative game between the features of a model and distribute influence among these input elements using some form of the game's unique Shapley values. Justification for these methods rests on two pillars: their desirable mathematical properties, and their applicability to specific motivations for explanations. We show that mathematical problems arise when Shapley values are used for feature importance and that the solutions to mitigate these necessarily induce further complexity, such as the need for causal reasoning. We also draw on additional literature to argue that Shapley values do not provide explanations which suit human-centric goals of explainability.

preprint2016arXiv

A Unified View of Localized Kernel Learning

Multiple Kernel Learning, or MKL, extends (kernelized) SVM by attempting to learn not only a classifier/regressor but also the best kernel for the training task, usually from a combination of existing kernel functions. Most MKL methods seek the combined kernel that performs best over every training example, sacrificing performance in some areas to seek a global optimum. Localized kernel learning (LKL) overcomes this limitation by allowing the training algorithm to match a component kernel to the examples that can exploit it best. Several approaches to the localized kernel learning problem have been explored in the last several years. We unify many of these approaches under one simple system and design a new algorithm with improved performance. We also develop enhanced versions of existing algorithms, with an eye on scalability and performance.

preprint2016arXiv

Auditing Black-box Models for Indirect Influence

Data-trained predictive models see widespread use, but for the most part they are used as black boxes which output a prediction or score. It is therefore hard to acquire a deeper understanding of model behavior, and in particular how different features influence the model prediction. This is important when interpreting the behavior of complex models, or asserting that certain problematic attributes (like race or gender) are not unduly influencing decisions. In this paper, we present a technique for auditing black-box models, which lets us study the extent to which existing models take advantage of particular features in the dataset, without knowing how the models work. Our work focuses on the problem of indirect influence: how some features might indirectly influence outcomes via other, related features. As a result, we can find attribute influences even in cases where, upon further direct examination of the model, the attribute is not referred to by the model at all. Our approach does not require the black-box model to be retrained. This is important if (for example) the model is only accessible via an API, and contrasts our work with other methods that investigate feature influence like feature selection. We present experimental evidence for the effectiveness of our procedure using a variety of publicly available datasets and models. We also validate our procedure using techniques from interpretable learning and feature selection, as well as against other black-box auditing procedures.

preprint2016arXiv

On the (im)possibility of fairness

What does it mean for an algorithm to be fair? Different papers use different notions of algorithmic fairness, and although these appear internally consistent, they also seem mutually incompatible. We present a mathematical setting in which the distinctions in previous papers can be made formal. In addition to characterizing the spaces of inputs (the "observed" space) and outputs (the "decision" space), we introduce the notion of a construct space: a space that captures unobservable, but meaningful variables for the prediction. We show that in order to prove desirable properties of the entire decision-making process, different mechanisms for fairness require different assumptions about the nature of the mapping from construct space to decision space. The results in this paper imply that future treatments of algorithmic fairness should more explicitly state assumptions about the relationship between constructs and observations.

preprint2015arXiv

A directed isoperimetric inequality with application to Bregman near neighbor lower bounds

Bregman divergences $D_ϕ$ are a class of divergences parametrized by a convex function $ϕ$ and include well known distance functions like $\ell_2^2$ and the Kullback-Leibler divergence. There has been extensive research on algorithms for problems like clustering and near neighbor search with respect to Bregman divergences, in all cases, the algorithms depend not just on the data size $n$ and dimensionality $d$, but also on a structure constant $μ\ge 1$ that depends solely on $ϕ$ and can grow without bound independently. In this paper, we provide the first evidence that this dependence on $μ$ might be intrinsic. We focus on the problem of approximate near neighbor search for Bregman divergences. We show that under the cell probe model, any non-adaptive data structure (like locality-sensitive hashing) for $c$-approximate near-neighbor search that admits $r$ probes must use space $Ω(n^{1 + \fracμ{c r}})$. In contrast, for LSH under $\ell_1$ the best bound is $Ω(n^{1+\frac{1}{cr}})$. Our new tool is a directed variant of the standard boolean noise operator. We show that a generalization of the Bonami-Beckner hypercontractivity inequality exists "in expectation" or upon restriction to certain subsets of the Hamming cube, and that this is sufficient to prove the desired isoperimetric inequality that we use in our data structure lower bound. We also present a structural result reducing the Hamming cube to a Bregman cube. This structure allows us to obtain lower bounds for problems under Bregman divergences from their $\ell_1$ analog. In particular, we get a (weaker) lower bound for approximate near neighbor search of the form $Ω(n^{1 + \frac{1}{cr}})$ for an $r$-query non-adaptive data structure, and new cell probe lower bounds for a number of other near neighbor questions in Bregman space.

preprint2015arXiv

A Group Theoretic Perspective on Unsupervised Deep Learning

Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called {\em pretraining}: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.

preprint2015arXiv

Certifying and removing disparate impact

What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender, religious practice) and an explicit description of the process. When the process is implemented using computers, determining disparate impact (and hence bias) is harder. It might not be possible to disclose the process. In addition, even if the process is open, it might be hard to elucidate in a legal setting how the algorithm makes its decisions. Instead of requiring access to the algorithm, we propose making inferences based on the data the algorithm uses. We make four contributions to this problem. First, we link the legal notion of disparate impact to a measure of classification accuracy that while known, has received relatively little attention. Second, we propose a test for disparate impact based on analyzing the information leakage of the protected class from the other data attributes. Third, we describe methods by which data might be made unbiased. Finally, we present empirical evidence supporting the effectiveness of our test for disparate impact and our approach for both masking bias and preserving relevant information in the data. Interestingly, our approach resembles some actual selection practices that have recently received legal scrutiny.

preprint2015arXiv

Sketching, Embedding, and Dimensionality Reduction for Information Spaces

Information distances like the Hellinger distance and the Jensen-Shannon divergence have deep roots in information theory and machine learning. They are used extensively in data analysis especially when the objects being compared are high dimensional empirical probability distributions built from data. However, we lack common tools needed to actually use information distances in applications efficiently and at scale with any kind of provable guarantees. We can't sketch these distances easily, or embed them in better behaved spaces, or even reduce the dimensionality of the space while maintaining the probability structure of the data. In this paper, we build these tools for information distances---both for the Hellinger distance and Jensen--Shannon divergence, as well as related measures, like the $χ^2$ divergence. We first show that they can be sketched efficiently (i.e. up to multiplicative error in sublinear space) in the aggregate streaming model. This result is exponentially stronger than known upper bounds for sketching these distances in the strict turnstile streaming model. Second, we show a finite dimensionality embedding result for the Jensen-Shannon and $χ^2$ divergences that preserves pair wise distances. Finally we prove a dimensionality reduction result for the Hellinger, Jensen--Shannon, and $χ^2$ divergences that preserves the information geometry of the distributions (specifically, by retaining the simplex structure of the space). While our second result above already implies that these divergences can be explicitly embedded in Euclidean space, retaining the simplex structure is important because it allows us to continue doing inference in the reduced space. In essence, we preserve not just the distance structure but the underlying geometry of the space.

preprint2015arXiv

Streaming Verification in Data Analysis

Streaming interactive proofs (SIPs) are a framework to reason about outsourced computation, where a data owner (the verifier) outsources a computation to the cloud (the prover), but wishes to verify the correctness of the solution provided by the cloud service. In this paper we present streaming interactive proofs for problems in data analysis. We present protocols for clustering and shape fitting problems, as well as an improved protocol for rectangular matrix multiplication. The latter can in turn be used to verify $k$ eigenvectors of a (streamed) $n \times n$ matrix. In general our solutions use polylogarithmic rounds of communication and polylogarithmic total communication and verifier space. For special cases (when optimality certificates can be verified easily), we present constant round protocols with similar costs. For rectangular matrix multiplication and eigenvector verification, our protocols work in the more restricted annotated data streaming model, and use sublinear (but not polylogarithmic) communication.

preprint2015arXiv

Why does Deep Learning work? - A perspective from Group Theory

Why does Deep Learning work? What representations does it capture? How do higher-order representations emerge? We study these questions from the perspective of group theory, thereby opening a new approach towards a theory of Deep learning. One factor behind the recent resurgence of the subject is a key algorithmic step called pre-training: first search for a good generative model for the input samples, and repeat the process one layer at a time. We show deeper implications of this simple principle, by establishing a connection with the interplay of orbits and stabilizers of group actions. Although the neural networks themselves may not form groups, we show the existence of {\em shadow} groups whose elements serve as close approximations. Over the shadow groups, the pre-training step, originally introduced as a mechanism to better initialize a network, becomes equivalent to a search for features with minimal orbits. Intuitively, these features are in a way the {\em simplest}. Which explains why a deep learning network learns simple features first. Next, we show how the same principle, when repeated in the deeper layers, can capture higher order representations, and why representation complexity increases as the layers get deeper.

preprint2014arXiv

A Geometric Algorithm for Scalable Multiple Kernel Learning

We present a geometric formulation of the Multiple Kernel Learning (MKL) problem. To do so, we reinterpret the problem of learning kernel weights as searching for a kernel that maximizes the minimum (kernel) distance between two convex polytopes. This interpretation combined with novel structural insights from our geometric formulation allows us to reduce the MKL problem to a simple optimization routine that yields provable convergence as well as quality guarantees. As a result our method scales efficiently to much larger data sets than most prior methods can handle. Empirical evaluation on eleven datasets shows that we are significantly faster and even compare favorably with a uniform unweighted combination of kernels.

preprint2013arXiv

Approximate Bregman near neighbors in sublinear time: Beyond the triangle inequality

In this paper we present the first provable approximate nearest-neighbor (ANN) algorithms for Bregman divergences. Our first algorithm processes queries in O(log^d n) time using O(n log^d n) space and only uses general properties of the underlying distance function (which includes Bregman divergences as a special case). The second algorithm processes queries in O(log n) time using O(n) space and exploits structural constants associated specifically with Bregman divergences. An interesting feature of our algorithms is that they extend the ring-tree + quad-tree paradigm for ANN searching beyond Euclidean distances and metrics of bounded doubling dimension to distances that might not even be symmetric or satisfy a triangle inequality.

preprint2013arXiv

Multiple Target Tracking with RF Sensor Networks

RF sensor networks are wireless networks that can localize and track people (or targets) without needing them to carry or wear any electronic device. They use the change in the received signal strength (RSS) of the links due to the movements of people to infer their locations. In this paper, we consider real-time multiple target tracking with RF sensor networks. We perform radio tomographic imaging (RTI), which generates images of the change in the propagation field, as if they were frames of a video. Our RTI method uses RSS measurements on multiple frequency channels on each link, combining them with a fade level-based weighted average. We describe methods to adapt machine vision methods to the peculiarities of RTI to enable real time multiple target tracking. Several tests are performed in an open environment, a one-bedroom apartment, and a cluttered office environment. The results demonstrate that the system is capable of accurately tracking in real-time up to 4 targets in cluttered indoor environments, even when their trajectories intersect multiple times, without mis-estimating the number of targets found in the monitored area. The highest average tracking error measured in the tests is 0.45 m with two targets, 0.46 m with three targets, and 0.55 m with four targets.

preprint2013arXiv

Power to the Points: Validating Data Memberships in Clusterings

A clustering is an implicit assignment of labels of points, based on proximity to other points. It is these labels that are then used for downstream analysis (either focusing on individual clusters, or identifying representatives of clusters and so on). Thus, in order to trust a clustering as a first step in exploratory data analysis, we must trust the labels assigned to individual data. Without supervision, how can we validate this assignment? In this paper, we present a method to attach affinity scores to the implicit labels of individual points in a clustering. The affinity scores capture the confidence level of the cluster that claims to "own" the point. This method is very general: it can be used with clusterings derived from Euclidean data, kernelized data, or even data derived from information spaces. It smoothly incorporates importance functions on clusters, allowing us to eight different clusters differently. It is also efficient: assigning an affinity score to a point depends only polynomially on the number of clusters and is independent of the number of points in the data. The dimensionality of the underlying space only appears in preprocessing. We demonstrate the value of our approach with an experimental study that illustrates the use of these scores in different data analysis tasks, as well as the efficiency and flexibility of the method. We also demonstrate useful visualizations of these scores; these might prove useful within an interactive analytics framework.

preprint2013arXiv

Rethinking Abstractions for Big Data: Why, Where, How, and What

Big data refers to large and complex data sets that, under existing approaches, exceed the capacity and capability of current compute platforms, systems software, analytical tools and human understanding. Numerous lessons on the scalability of big data can already be found in asymptotic analysis of algorithms and from the high-performance computing (HPC) and applications communities. However, scale is only one aspect of current big data trends; fundamentally, current and emerging problems in big data are a result of unprecedented complexity--in the structure of the data and how to analyze it, in dealing with unreliability and redundancy, in addressing the human factors of comprehending complex data sets, in formulating meaningful analyses, and in managing the dense, power-hungry data centers that house big data. The computer science solution to complexity is finding the right abstractions, those that hide as much triviality as possible while revealing the essence of the problem that is being addressed. The "big data challenge" has disrupted computer science by stressing to the very limits the familiar abstractions which define the relevant subfields in data analysis, data management and the underlying parallel systems. As a result, not enough of these challenges are revealed by isolating abstractions in a traditional software stack or standard algorithmic and analytical techniques, and attempts to address complexity either oversimplify or require low-level management of details. The authors believe that the abstractions for big data need to be rethought, and this reorganization needs to evolve and be sustained through continued cross-disciplinary collaboration.

preprint2012arXiv

Efficient Protocols for Distributed Classification and Optimization

In distributed learning, the goal is to perform a learning task over data distributed across multiple nodes with minimal (expensive) communication. Prior work (Daume III et al., 2012) proposes a general model that bounds the communication required for learning classifiers while allowing for $\eps$ training error on linearly separable data adversarially distributed across nodes. In this work, we develop key improvements and extensions to this basic model. Our first result is a two-party multiplicative-weight-update based protocol that uses $O(d^2 \log{1/\eps})$ words of communication to classify distributed data in arbitrary dimension $d$, $\eps$-optimally. This readily extends to classification over $k$ nodes with $O(kd^2 \log{1/\eps})$ words of communication. Our proposed protocol is simple to implement and is considerably more efficient than baselines compared, as demonstrated by our empirical results. In addition, we illustrate general algorithm design paradigms for doing efficient learning over distributed data. We show how to solve fixed-dimensional and high dimensional linear programming efficiently in a distributed setting where constraints may be distributed across nodes. Since many learning problems can be viewed as convex optimization problems where constraints are generated by individual points, this models many typical distributed learning scenarios. Our techniques make use of a novel connection from multipass streaming, as well as adapting the multiplicative-weight-update framework more generally to a distributed setting. As a consequence, our methods extend to the wide range of problems solvable using these techniques.

preprint2012arXiv

Protocols for Learning Classifiers on Distributed Data

We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node.

preprint2011arXiv

A Gentle Introduction to the Kernel Distance

This document reviews the definition of the kernel distance, providing a gentle introduction tailored to a reader with background in theoretical computer science, but limited exposure to technology more common to machine learning, functional analysis and geometric measure theory. The key aspect of the kernel distance developed here is its interpretation as an L_2 distance between probability measures or various shapes (e.g. point sets, curves, surfaces) embedded in a vector space (specifically an RKHS). This structure enables several elegant and efficient solutions to data analysis problems. We conclude with a glimpse into the mathematical underpinnings of this measure, highlighting its recent independent evolution in two separate fields.

preprint2011arXiv

Approximation Analysis of Influence Spread in Social Networks

In the context of influence propagation in a social graph, we can identify three orthogonal dimensions - the number of seed nodes activated at the beginning (known as budget), the expected number of activated nodes at the end of the propagation (known as expected spread or coverage), and the time taken for the propagation. We can constrain one or two of these and try to optimize the third. In their seminal paper, Kempe et al. constrained the budget, left time unconstrained, and maximized the coverage: this problem is known as Influence Maximization. In this paper, we study alternative optimization problems which are naturally motivated by resource and time constraints on viral marketing campaigns. In the first problem, termed Minimum Target Set Selection (or MINTSS for short), a coverage threshold n is given and the task is to find the minimum size seed set such that by activating it, at least n nodes are eventually activated in the expected sense. In the second problem, termed MINTIME, a coverage threshold n and a budget threshold k are given, and the task is to find a seed set of size at most k such that by activating it, at least n nodes are activated, in the minimum possible time. Both these problems are NP-hard, which motivates our interest in their approximation. For MINTSS, we develop a simple greedy algorithm and show that it provides a bicriteria approximation. We also establish a generic hardness result suggesting that improving it is likely to be hard. For MINTIME, we show that even bicriteria and tricriteria approximations are hard under several conditions. However, if we allow the budget to be boosted by a logarithmic factor and allow the coverage to fall short, then the problem can be solved exactly in PTIME. Finally, we show the value of the approximation algorithms, by comparing them against various heuristics.

preprint2011arXiv

Comparing Distributions and Shapes using the Kernel Distance

Starting with a similarity function between objects, it is possible to define a distance metric on pairs of objects, and more generally on probability distributions over them. These distance metrics have a deep basis in functional analysis, measure theory and geometric measure theory, and have a rich structure that includes an isometric embedding into a (possibly infinite dimensional) Hilbert space. They have recently been applied to numerous problems in machine learning and shape analysis. In this paper, we provide the first algorithmic analysis of these distance metrics. Our main contributions are as follows: (i) We present fast approximation algorithms for computing the kernel distance between two point sets P and Q that runs in near-linear time in the size of (P cup Q) (note that an explicit calculation would take quadratic time). (ii) We present polynomial-time algorithms for approximately minimizing the kernel distance under rigid transformation; they run in time O(n + poly(1/epsilon, log n)). (iii) We provide several general techniques for reducing complex objects to convenient sparse representations (specifically to point sets or sets of points sets) which approximately preserve the kernel distance. In particular, this allows us to reduce problems of computing the kernel distance between various types of objects such as curves, surfaces, and distributions to computing the kernel distance between point sets. These take advantage of the reproducing kernel Hilbert space and a new relation linking binary range spaces to continuous range spaces with bounded fat-shattering dimension.

preprint2011arXiv

Generating a Diverse Set of High-Quality Clusterings

We provide a new framework for generating multiple good quality partitions (clusterings) of a single data set. Our approach decomposes this problem into two components, generating many high-quality partitions, and then grouping these partitions to obtain k representatives. The decomposition makes the approach extremely modular and allows us to optimize various criteria that control the choice of representative partitions.

preprint2011arXiv

Spatially-Aware Comparison and Consensus for Clusterings

This paper proposes a new distance metric between clusterings that incorporates information about the spatial distribution of points and clusters. Our approach builds on the idea of a Hilbert space-based representation of clusters as a combination of the representations of their constituent points. We use this representation and the underlying metric to design a spatially-aware consensus clustering procedure. This consensus procedure is implemented via a novel reduction to Euclidean clustering, and is both simple and efficient. All of our results apply to both soft and hard clusterings. We accompany these algorithms with a detailed experimental evaluation that demonstrates the efficiency and quality of our techniques.

preprint2010arXiv

A Unified Algorithmic Framework for Multi-Dimensional Scaling

In this paper, we propose a unified algorithmic framework for solving many known variants of \mds. Our algorithm is a simple iterative scheme with guaranteed convergence, and is \emph{modular}; by changing the internals of a single subroutine in the algorithm, we can switch cost functions and target spaces easily. In addition to the formal guarantees of convergence, our algorithms are accurate; in most cases, they converge to better quality solutions than existing methods, in comparable time. We expect that this framework will be useful for a number of \mds variants that have not yet been studied. Our framework extends to embedding high-dimensional points lying on a sphere to points on a lower dimensional sphere, preserving geodesic distances. As a compliment to this result, we also extend the Johnson-Lindenstrauss Lemma to this spherical setting, where projecting to a random $O((1/\eps^2) \log n)$-dimensional sphere causes $\eps$-distortion.

preprint2005arXiv

The Hunting of the Bump: On Maximizing Statistical Discrepancy

Anomaly detection has important applications in biosurveilance and environmental monitoring. When comparing measured data to data drawn from a baseline distribution, merely, finding clusters in the measured data may not actually represent true anomalies. These clusters may likely be the clusters of the baseline distribution. Hence, a discrepancy function is often used to examine how different measured data is to baseline data within a region. An anomalous region is thus defined to be one with high discrepancy. In this paper, we present algorithms for maximizing statistical discrepancy functions over the space of axis-parallel rectangles. We give provable approximation guarantees, both additive and relative, and our methods apply to any convex discrepancy function. Our algorithms work by connecting statistical discrepancy to combinatorial discrepancy; roughly speaking, we show that in order to maximize a convex discrepancy function over a class of shapes, one needs only maximize a linear discrepancy function over the same set of shapes. We derive general discrepancy functions for data generated from a one- parameter exponential family. This generalizes the widely-used Kulldorff scan statistic for data from a Poisson distribution. We present an algorithm running in $O(\smash[tb]{\frac{1}ε n^2 \log^2 n})$ that computes the maximum discrepancy rectangle to within additive error $ε$, for the Kulldorff scan statistic. Similar results hold for relative error and for discrepancy functions for data coming from Gaussian, Bernoulli, and gamma distributions. Prior to our work, the best known algorithms were exact and ran in time $\smash[t]{O(n^4)}$.

Suresh Venkatasubramanian

What is connected

Connect this record

See the researcher in context

Building this map preview

31 published item(s)

Bridging Prediction and Intervention Problems in Social Systems

The Commodification of AI Sovereignty: Lessons from the Fight for Sovereign Oil

It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

Measuring and mitigating voting access disparities: a study of race and polling locations in Florida and North Carolina

A Research Ecosystem for Secure Computing

Fair clustering via equitable group representations

Evolving Methods for Evaluating and Disseminating Computing Research

Problems with Shapley-value-based explanations as feature importance measures

A Unified View of Localized Kernel Learning

Auditing Black-box Models for Indirect Influence

On the (im)possibility of fairness

A directed isoperimetric inequality with application to Bregman near neighbor lower bounds

A Group Theoretic Perspective on Unsupervised Deep Learning

Certifying and removing disparate impact

Sketching, Embedding, and Dimensionality Reduction for Information Spaces

Streaming Verification in Data Analysis

Why does Deep Learning work? - A perspective from Group Theory

A Geometric Algorithm for Scalable Multiple Kernel Learning

Approximate Bregman near neighbors in sublinear time: Beyond the triangle inequality

Multiple Target Tracking with RF Sensor Networks

Power to the Points: Validating Data Memberships in Clusterings

Rethinking Abstractions for Big Data: Why, Where, How, and What

Efficient Protocols for Distributed Classification and Optimization

Protocols for Learning Classifiers on Distributed Data

A Gentle Introduction to the Kernel Distance

Approximation Analysis of Influence Spread in Social Networks

Comparing Distributions and Shapes using the Kernel Distance

Generating a Diverse Set of High-Quality Clusterings

Spatially-Aware Comparison and Consensus for Clusterings

A Unified Algorithmic Framework for Multi-Dimensional Scaling

The Hunting of the Bump: On Maximizing Statistical Discrepancy