Researcher profile

Suresh Venkatasubramanian

Suresh Venkatasubramanian contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
15works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

15 published item(s)

preprint2026arXiv

Bridging Prediction and Intervention Problems in Social Systems

Many automated decision systems (ADS) are designed to solve prediction problems -- where the goal is to learn patterns from a sample of the population and apply them to individuals from the same population. In reality, these prediction systems operationalize holistic policy interventions in deployment. Once deployed, ADS can shape impacted population outcomes through an effective policy change in how decision-makers operate, while also being defined by past and present interactions between stakeholders and the limitations of existing organizational, as well as societal, infrastructure and context. In this work, we consider the ways in which we must shift from a prediction-focused paradigm to an intervention-oriented paradigm when considering the impact of ADS within social systems. We argue this requires a new default problem setup for ADS beyond prediction, to instead consider predictions as decision support, final decisions, and outcomes. We highlight how this perspective unifies modern statistical frameworks and other tools to study the design, implementation, and evaluation of ADS systems, and point to the research directions necessary to operationalize this paradigm shift. Using these tools, we characterize the limitations of focusing on isolated prediction tasks, and lay the foundation for a more intervention-oriented approach to developing and deploying ADS.

preprint2026arXiv

The Commodification of AI Sovereignty: Lessons from the Fight for Sovereign Oil

"Sovereignty" is increasingly a part of national AI policies and strategies. At the same time that "sovereignty" is invoked as a priority for global AI policy, it is also being commodified along the AI stack. Companies now sell "sovereign" AI factories, clouds, and language models to governments, enterprises, and communities -- turning a contested value into a commercial commodity. This shift risks allowing private technology providers to define sovereignty on their own terms. By analyzing the history of sovereignty and parallels in global oil production, this paper aims to open avenues to interrogate the implications of this value's commercialization. The contributions of this paper lie in a disentangling of the facets of sovereignty being appealed to through the AI stack and a case for how analogizing oil and AI can be generative in thinking through what is achieved and what can be achieved through the commodification of AI sovereignty.

preprint2022arXiv

It's COMPASlicated: The Messy Relationship between RAI Datasets and Algorithmic Fairness Benchmarks

Risk assessment instrument (RAI) datasets, particularly ProPublica's COMPAS dataset, are commonly used in algorithmic fairness papers due to benchmarking practices of comparing algorithms on datasets used in prior work. In many cases, this data is used as a benchmark to demonstrate good performance without accounting for the complexities of criminal justice (CJ) processes. However, we show that pretrial RAI datasets can contain numerous measurement biases and errors, and due to disparities in discretion and deployment, algorithmic fairness applied to RAI datasets is limited in making claims about real-world outcomes. These reasons make the datasets a poor fit for benchmarking under assumptions of ground truth and real-world impact. Furthermore, conventional practices of simply replicating previous data experiments may implicitly inherit or edify normative positions without explicitly interrogating value-laden assumptions. Without context of how interdisciplinary fields have engaged in CJ research and context of how RAIs operate upstream and downstream, algorithmic fairness practices are misaligned for meaningful contribution in the context of CJ, and would benefit from transparent engagement with normative considerations and values related to fairness, justice, and equality. These factors prompt questions about whether benchmarks for intrinsically socio-technical systems like the CJ system can exist in a beneficial and ethical way.

preprint2022arXiv

Measuring and mitigating voting access disparities: a study of race and polling locations in Florida and North Carolina

Voter suppression and associated racial disparities in access to voting are long-standing civil rights concerns in the United States. Barriers to voting have taken many forms over the decades. A history of violent explicit discouragement has shifted to more subtle access limitations that can include long lines and wait times, long travel times to reach a polling station, and other logistical barriers to voting. Our focus in this work is on quantifying disparities in voting access pertaining to the overall time-to-vote, and how they could be remedied via a better choice of polling location or provisioning more sites where voters can cast ballots. However, appropriately calibrating access disparities is difficult because of the need to account for factors such as population density and different community expectations for reasonable travel times. In this paper, we quantify access to polling locations, developing a methodology for the calibrated measurement of racial disparities in polling location "load" and distance to polling locations. We apply this methodology to a study of real-world data from Florida and North Carolina to identify disparities in voting access from the 2020 election. We also introduce algorithms, with modifications to handle scale, that can reduce these disparities by suggesting new polling locations from a given list of identified public locations (including schools and libraries). Applying these algorithms on the 2020 election location data also helps to expose and explore tradeoffs between the cost of allocating more polling locations and the potential impact on access disparities. The developed voting access measurement methodology and algorithmic remediation technique is a first step in better polling location assignment.

preprint2021arXiv

A Research Ecosystem for Secure Computing

Computing devices are vital to all areas of modern life and permeate every aspect of our society. The ubiquity of computing and our reliance on it has been accelerated and amplified by the COVID-19 pandemic. From education to work environments to healthcare to defense to entertainment - it is hard to imagine a segment of modern life that is not touched by computing. The security of computers, systems, and applications has been an active area of research in computer science for decades. However, with the confluence of both the scale of interconnected systems and increased adoption of artificial intelligence, there are many research challenges the community must face so that our society can continue to benefit and risks are minimized, not multiplied. Those challenges range from security and trust of the information ecosystem to adversarial artificial intelligence and machine learning. Along with basic research challenges, more often than not, securing a system happens after the design or even deployment, meaning the security community is routinely playing catch-up and attempting to patch vulnerabilities that could be exploited any minute. While security measures such as encryption and authentication have been widely adopted, questions of security tend to be secondary to application capability. There needs to be a sea-change in the way we approach this critically important aspect of the problem: new incentives and education are at the core of this change. Now is the time to refocus research community efforts on developing interconnected technologies with security "baked in by design" and creating an ecosystem that ensures adoption of promising research developments. To realize this vision, two additional elements of the ecosystem are necessary - proper incentive structures for adoption and an educated citizenry that is well versed in vulnerabilities and risks.

preprint2021arXiv

Fair clustering via equitable group representations

What does it mean for a clustering to be fair? One popular approach seeks to ensure that each cluster contains groups in (roughly) the same proportion in which they exist in the population. The normative principle at play is balance: any cluster might act as a representative of the data, and thus should reflect its diversity. But clustering also captures a different form of representativeness. A core principle in most clustering problems is that a cluster center should be representative of the cluster it represents, by being "close" to the points associated with it. This is so that we can effectively replace the points by their cluster centers without significant loss in fidelity, and indeed is a common "use case" for clustering. For such a clustering to be fair, the centers should "represent" different groups equally well. We call such a clustering a group-representative clustering. In this paper, we study the structure and computation of group-representative clusterings. We show that this notion naturally parallels the development of fairness notions in classification, with direct analogs of ideas like demographic parity and equal opportunity. We demonstrate how these notions are distinct from and cannot be captured by balance-based notions of fairness. We present approximation algorithms for group representative $k$-median clustering and couple this with an empirical evaluation on various real-world data sets.

preprint2020arXiv

Evolving Methods for Evaluating and Disseminating Computing Research

Social and technical trends have significantly changed methods for evaluating and disseminating computing research. Traditional venues for reviewing and publishing, such as conferences and journals, worked effectively in the past. Recently, trends have created new opportunities but also put new pressures on the process of review and dissemination. For example, many conferences have seen large increases in the number of submissions. Likewise, dissemination of research ideas has become dramatically through publication venues such as arXiv.org and social media networks. While these trends predate COVID-19, the pandemic could accelerate longer term changes. Based on interviews with leading academics in computing research, our findings include: (1) Trends impacting computing research are largely positive and have increased the participation, scope, accessibility, and speed of the research process. (2) Challenges remain in securing the integrity of the process, including addressing ways to scale the review process, avoiding attempts to misinform or confuse the dissemination of results, and ensuring fairness and broad participation in the process itself. Based on these findings, we recommend: (1) Regularly polling members of the computing research community, including program and general conference chairs, journal editors, authors, reviewers, etc., to identify specific challenges they face to better understand these issues. (2) An influential body, such as the Computing Research Association regularly issues a "State of the Computing Research Enterprise" report to update the community on trends, both positive and negative, impacting the computing research enterprise. (3) A deeper investigation, specifically to better understand the influence that social media and preprint archives have on computing research, is conducted.

preprint2020arXiv

Problems with Shapley-value-based explanations as feature importance measures

Game-theoretic formulations of feature importance have become popular as a way to "explain" machine learning models. These methods define a cooperative game between the features of a model and distribute influence among these input elements using some form of the game's unique Shapley values. Justification for these methods rests on two pillars: their desirable mathematical properties, and their applicability to specific motivations for explanations. We show that mathematical problems arise when Shapley values are used for feature importance and that the solutions to mitigate these necessarily induce further complexity, such as the need for causal reasoning. We also draw on additional literature to argue that Shapley values do not provide explanations which suit human-centric goals of explainability.

preprint2012arXiv

Efficient Protocols for Distributed Classification and Optimization

In distributed learning, the goal is to perform a learning task over data distributed across multiple nodes with minimal (expensive) communication. Prior work (Daume III et al., 2012) proposes a general model that bounds the communication required for learning classifiers while allowing for $\eps$ training error on linearly separable data adversarially distributed across nodes. In this work, we develop key improvements and extensions to this basic model. Our first result is a two-party multiplicative-weight-update based protocol that uses $O(d^2 \log{1/\eps})$ words of communication to classify distributed data in arbitrary dimension $d$, $\eps$-optimally. This readily extends to classification over $k$ nodes with $O(kd^2 \log{1/\eps})$ words of communication. Our proposed protocol is simple to implement and is considerably more efficient than baselines compared, as demonstrated by our empirical results. In addition, we illustrate general algorithm design paradigms for doing efficient learning over distributed data. We show how to solve fixed-dimensional and high dimensional linear programming efficiently in a distributed setting where constraints may be distributed across nodes. Since many learning problems can be viewed as convex optimization problems where constraints are generated by individual points, this models many typical distributed learning scenarios. Our techniques make use of a novel connection from multipass streaming, as well as adapting the multiplicative-weight-update framework more generally to a distributed setting. As a consequence, our methods extend to the wide range of problems solvable using these techniques.

preprint2012arXiv

Protocols for Learning Classifiers on Distributed Data

We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node.

preprint2011arXiv

A Gentle Introduction to the Kernel Distance

This document reviews the definition of the kernel distance, providing a gentle introduction tailored to a reader with background in theoretical computer science, but limited exposure to technology more common to machine learning, functional analysis and geometric measure theory. The key aspect of the kernel distance developed here is its interpretation as an L_2 distance between probability measures or various shapes (e.g. point sets, curves, surfaces) embedded in a vector space (specifically an RKHS). This structure enables several elegant and efficient solutions to data analysis problems. We conclude with a glimpse into the mathematical underpinnings of this measure, highlighting its recent independent evolution in two separate fields.

preprint2011arXiv

Approximation Analysis of Influence Spread in Social Networks

In the context of influence propagation in a social graph, we can identify three orthogonal dimensions - the number of seed nodes activated at the beginning (known as budget), the expected number of activated nodes at the end of the propagation (known as expected spread or coverage), and the time taken for the propagation. We can constrain one or two of these and try to optimize the third. In their seminal paper, Kempe et al. constrained the budget, left time unconstrained, and maximized the coverage: this problem is known as Influence Maximization. In this paper, we study alternative optimization problems which are naturally motivated by resource and time constraints on viral marketing campaigns. In the first problem, termed Minimum Target Set Selection (or MINTSS for short), a coverage threshold n is given and the task is to find the minimum size seed set such that by activating it, at least n nodes are eventually activated in the expected sense. In the second problem, termed MINTIME, a coverage threshold n and a budget threshold k are given, and the task is to find a seed set of size at most k such that by activating it, at least n nodes are activated, in the minimum possible time. Both these problems are NP-hard, which motivates our interest in their approximation. For MINTSS, we develop a simple greedy algorithm and show that it provides a bicriteria approximation. We also establish a generic hardness result suggesting that improving it is likely to be hard. For MINTIME, we show that even bicriteria and tricriteria approximations are hard under several conditions. However, if we allow the budget to be boosted by a logarithmic factor and allow the coverage to fall short, then the problem can be solved exactly in PTIME. Finally, we show the value of the approximation algorithms, by comparing them against various heuristics.

preprint2011arXiv

Comparing Distributions and Shapes using the Kernel Distance

Starting with a similarity function between objects, it is possible to define a distance metric on pairs of objects, and more generally on probability distributions over them. These distance metrics have a deep basis in functional analysis, measure theory and geometric measure theory, and have a rich structure that includes an isometric embedding into a (possibly infinite dimensional) Hilbert space. They have recently been applied to numerous problems in machine learning and shape analysis. In this paper, we provide the first algorithmic analysis of these distance metrics. Our main contributions are as follows: (i) We present fast approximation algorithms for computing the kernel distance between two point sets P and Q that runs in near-linear time in the size of (P cup Q) (note that an explicit calculation would take quadratic time). (ii) We present polynomial-time algorithms for approximately minimizing the kernel distance under rigid transformation; they run in time O(n + poly(1/epsilon, log n)). (iii) We provide several general techniques for reducing complex objects to convenient sparse representations (specifically to point sets or sets of points sets) which approximately preserve the kernel distance. In particular, this allows us to reduce problems of computing the kernel distance between various types of objects such as curves, surfaces, and distributions to computing the kernel distance between point sets. These take advantage of the reproducing kernel Hilbert space and a new relation linking binary range spaces to continuous range spaces with bounded fat-shattering dimension.

preprint2011arXiv

Generating a Diverse Set of High-Quality Clusterings

We provide a new framework for generating multiple good quality partitions (clusterings) of a single data set. Our approach decomposes this problem into two components, generating many high-quality partitions, and then grouping these partitions to obtain k representatives. The decomposition makes the approach extremely modular and allows us to optimize various criteria that control the choice of representative partitions.

preprint2010arXiv

A Unified Algorithmic Framework for Multi-Dimensional Scaling

In this paper, we propose a unified algorithmic framework for solving many known variants of \mds. Our algorithm is a simple iterative scheme with guaranteed convergence, and is \emph{modular}; by changing the internals of a single subroutine in the algorithm, we can switch cost functions and target spaces easily. In addition to the formal guarantees of convergence, our algorithms are accurate; in most cases, they converge to better quality solutions than existing methods, in comparable time. We expect that this framework will be useful for a number of \mds variants that have not yet been studied. Our framework extends to embedding high-dimensional points lying on a sphere to points on a lower dimensional sphere, preserving geodesic distances. As a compliment to this result, we also extend the Johnson-Lindenstrauss Lemma to this spherical setting, where projecting to a random $O((1/\eps^2) \log n)$-dimensional sphere causes $\eps$-distortion.