Researcher profile

Cynthia Rudin

Cynthia Rudin contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
19works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

19 published item(s)

preprint2022arXiv

A robust approach to quantifying uncertainty in matching problems of causal inference

Unquantified sources of uncertainty in observational causal analyses can break the integrity of the results. One would never want another analyst to repeat a calculation with the same dataset, using a seemingly identical procedure, only to find a different conclusion. However, as we show in this work, there is a typical source of uncertainty that is essentially never considered in observational causal studies: the choice of match assignment for matched groups, that is, which unit is matched to which other unit before a hypothesis test is conducted. The choice of match assignment is anything but innocuous, and can have a surprisingly large influence on the causal conclusions. Given that a vast number of causal inference studies test hypotheses on treatment effects after treatment cases are matched with similar control cases, we should find a way to quantify how much this extra source of uncertainty impacts results. What we would really like to be able to report is that \emph{no matter} which match assignment is made, as long as the match is sufficiently good, then the hypothesis test result still holds. In this paper, we provide methodology based on discrete optimization to create robust tests that explicitly account for this possibility. We formulate robust tests for binary and continuous data based on common test statistics as integer linear programs solvable with common methodologies. We study the finite-sample behavior of our test statistic in the discrete-data case. We apply our methods to simulated and real-world datasets and show that they can produce useful results in practical applied settings.

preprint2022arXiv

BacHMMachine: An Interpretable and Scalable Model for Algorithmic Harmonization for Four-part Baroque Chorales

Algorithmic harmonization - the automated harmonization of a musical piece given its melodic line - is a challenging problem that has garnered much interest from both music theorists and computer scientists. One genre of particular interest is the four-part Baroque chorales of J.S. Bach. Methods for algorithmic chorale harmonization typically adopt a black-box, "data-driven" approach: they do not explicitly integrate principles from music theory but rely on a complex learning model trained with a large amount of chorale data. We propose instead a new harmonization model, called BacHMMachine, which employs a "theory-driven" framework guided by music composition principles, along with a "data-driven" model for learning compositional features within this framework. As its name suggests, BacHMMachine uses a novel Hidden Markov Model based on key and chord transitions, providing a probabilistic framework for learning key modulations and chordal progressions from a given melodic line. This allows for the generation of creative, yet musically coherent chorale harmonizations; integrating compositional principles allows for a much simpler model that results in vast decreases in computational burden and greater interpretability compared to state-of-the-art algorithmic harmonization methods, at no penalty to quality of harmonization or musicality. We demonstrate this improvement via comprehensive experiments and Turing tests comparing BacHMMachine to existing methods.

preprint2022arXiv

Fast Sparse Decision Tree Optimization via Reference Ensembles

Sparse decision tree optimization has been one of the most fundamental problems in AI since its inception and is a challenge at the core of interpretable machine learning. Sparse decision tree optimization is computationally hard, and despite steady effort since the 1960's, breakthroughs have only been made on the problem within the past few years, primarily on the problem of finding optimal sparse decision trees. However, current state-of-the-art algorithms often require impractical amounts of computation time and memory to find optimal or near-optimal trees for some real-world datasets, particularly those having several continuous-valued features. Given that the search spaces of these decision tree optimization problems are massive, can we practically hope to find a sparse decision tree that competes in accuracy with a black box machine learning model? We address this problem via smart guessing strategies that can be applied to any optimal branch-and-bound-based decision tree algorithm. We show that by using these guesses, we can reduce the run time by multiple orders of magnitude, while providing bounds on how far the resulting trees can deviate from the black box's accuracy and expressive power. Our approach enables guesses about how to bin continuous features, the size of the tree, and lower bounds on the error for the optimal decision tree. Our experiments show that in many cases we can rapidly construct sparse decision trees that match the accuracy of black box models. To summarize: when you are having trouble optimizing, just guess.

preprint2022arXiv

In Pursuit of Interpretable, Fair and Accurate Machine Learning for Criminal Recidivism Prediction

Objectives: We study interpretable recidivism prediction using machine learning (ML) models and analyze performance in terms of prediction ability, sparsity, and fairness. Unlike previous works, this study trains interpretable models that output probabilities rather than binary predictions, and uses quantitative fairness definitions to assess the models. This study also examines whether models can generalize across geographic locations. Methods: We generated black-box and interpretable ML models on two different criminal recidivism datasets from Florida and Kentucky. We compared predictive performance and fairness of these models against two methods that are currently used in the justice system to predict pretrial recidivism: the Arnold PSA and COMPAS. We evaluated predictive performance of all models on predicting six different types of crime over two time spans. Results: Several interpretable ML models can predict recidivism as well as black-box ML models and are more accurate than COMPAS or the Arnold PSA. These models are potentially useful in practice. Similar to the Arnold PSA, some of these interpretable models can be written down as a simple table. Others can be displayed using a set of visualizations. Our geographic analysis indicates that ML models should be trained separately for separate locations and updated over time. We also present a fairness analysis for the interpretable models. Conclusions: Interpretable machine learning models can perform just as well as non-interpretable methods and currently-used risk assessment scales, in terms of both prediction accuracy and fairness. Machine learning models might be more accurate when trained separately for distinct locations and kept up-to-date.

preprint2022arXiv

On the Existence of Simpler Machine Learning Models

It is almost always easier to find an accurate-but-complex model than an accurate-yet-simple model. Finding optimal, sparse, accurate models of various forms (linear models with integer coefficients, decision sets, rule lists, decision trees) is generally NP-hard. We often do not know whether the search for a simpler model will be worthwhile, and thus we do not go to the trouble of searching for one. In this work, we ask an important practical question: can accurate-yet-simple models be proven to exist, or shown likely to exist, before explicitly searching for them? We hypothesize that there is an important reason that simple-yet-accurate models often do exist. This hypothesis is that the size of the Rashomon set is often large, where the Rashomon set is the set of almost-equally-accurate models from a function class. If the Rashomon set is large, it contains numerous accurate models, and perhaps at least one of them is the simple model we desire. In this work, we formally present the Rashomon ratio as a new gauge of simplicity for a learning problem, depending on a function class and a data set. The Rashomon ratio is the ratio of the volume of the set of accurate models to the volume of the hypothesis space, and it is different from standard complexity measures from statistical learning theory. Insight from studying the Rashomon ratio provides an easy way to check whether a simpler model might exist for a problem before finding it, namely whether several different machine learning methods achieve similar performance on the data. In that sense, the Rashomon ratio is a powerful tool for understanding why and when an accurate-yet-simple model might exist. If, as we hypothesize in this work, many real-world data sets admit large Rashomon sets, the implications are vast: it means that simple or interpretable models may often be used for high-stakes decisions without losing accuracy.

preprint2022arXiv

SegDiscover: Visual Concept Discovery via Unsupervised Semantic Segmentation

Visual concept discovery has long been deemed important to improve interpretability of neural networks, because a bank of semantically meaningful concepts would provide us with a starting point for building machine learning models that exhibit intelligible reasoning process. Previous methods have disadvantages: either they rely on labelled support sets that incorporate human biases for objects that are "useful," or they fail to identify multiple concepts that occur within a single image. We reframe the concept discovery task as an unsupervised semantic segmentation problem, and present SegDiscover, a novel framework that discovers semantically meaningful visual concepts from imagery datasets with complex scenes without supervision. Our method contains three important pieces: generating concept primitives from raw images, discovering concepts by clustering in the latent space of a self-supervised pretrained encoder, and concept refinement via neural network smoothing. Experimental results provide evidence that our method can discover multiple concepts within a single image and outperforms state-of-the-art unsupervised methods on complex datasets such as Cityscapes and COCO-Stuff. Our method can be further used as a neural network explanation tool by comparing results obtained by different encoders.

preprint2022arXiv

There is no Accuracy-Interpretability Tradeoff in Reinforcement Learning for Mazes

Interpretability is an essential building block for trustworthiness in reinforcement learning systems. However, interpretability might come at the cost of deteriorated performance, leading many researchers to build complex models. Our goal is to analyze the cost of interpretability. We show that in certain cases, one can achieve policy interpretability while maintaining its optimality. We focus on a classical problem from reinforcement learning: mazes with $k$ obstacles in $\mathbb{R}^d$. We prove the existence of a small decision tree with a linear function at each inner node and depth $O(\log k + 2^d)$ that represents an optimal policy. Note that for the interesting case of a constant $d$, we have $O(\log k)$ depth. Thus, in this setting, there is no accuracy-interpretability tradeoff. To prove this result, we use a new "compressing" technique that might be useful in additional settings.

preprint2021arXiv

FLAME: A Fast Large-scale Almost Matching Exactly Approach to Causal Inference

A classical problem in causal inference is that of matching, where treatment units need to be matched to control units based on covariate information. In this work, we propose a method that computes high quality almost-exact matches for high-dimensional categorical datasets. This method, called FLAME (Fast Large-scale Almost Matching Exactly), learns a distance metric for matching using a hold-out training data set. In order to perform matching efficiently for large datasets, FLAME leverages techniques that are natural for query processing in the area of database management, and two implementations of FLAME are provided: the first uses SQL queries and the second uses bit-vector techniques. The algorithm starts by constructing matches of the highest quality (exact matches on all covariates), and successively eliminates variables in order to match exactly on as many variables as possible, while still maintaining interpretable high-quality matches and balance between treatment and control groups. We leverage these high quality matches to estimate conditional average treatment effects (CATEs). Our experiments show that FLAME scales to huge datasets with millions of observations where existing state-of-the-art methods fail, and that it achieves significantly better performance than other matching methods.

preprint2021arXiv

Multitask Learning for Citation Purpose Classification

We present our entry into the 2021 3C Shared Task Citation Context Classification based on Purpose competition. The goal of the competition is to classify a citation in a scientific article based on its purpose. This task is important because it could potentially lead to more comprehensive ways of summarizing the purpose and uses of scientific articles, but it is also difficult, mainly due to the limited amount of available training data in which the purposes of each citation have been hand-labeled, along with the subjectivity of these labels. Our entry in the competition is a multi-task model that combines multiple modules designed to handle the problem from different perspectives, including hand-generated linguistic features, TF-IDF features, and an LSTM-with-attention model. We also provide an ablation study and feature analysis whose insights could lead to future work.

preprint2021arXiv

Regulating Greed Over Time in Multi-Armed Bandits

In retail, there are predictable yet dramatic time-dependent patterns in customer behavior, such as periodic changes in the number of visitors, or increases in customers just before major holidays. The current paradigm of multi-armed bandit analysis does not take these known patterns into account. This means that for applications in retail, where prices are fixed for periods of time, current bandit algorithms will not suffice. This work provides a remedy that takes the time-dependent patterns into account, and we show how this remedy is implemented for the UCB, $\varepsilon$-greedy, and UCB-L algorithms, and also through a new policy called the variable arm pool algorithm. In the corrected methods, exploitation (greed) is regulated over time, so that more exploitation occurs during higher reward periods, and more exploration occurs in periods of low reward. In order to understand why regret is reduced with the corrected methods, we present a set of bounds that provide insight into why we would want to exploit during periods of high reward, and discuss the impact on regret. Our proposed methods perform well in experiments, and were inspired by a high-scoring entry in the Exploration and Exploitation 3 contest using data from Yahoo$!$ Front Page. That entry heavily used time-series methods to regulate greed over time, which was substantially more effective than other contextual bandit methods.

preprint2021arXiv

There Once Was a Really Bad Poet, It Was Automated but You Didn't Know It

Limerick generation exemplifies some of the most difficult challenges faced in poetry generation, as the poems must tell a story in only five lines, with constraints on rhyme, stress, and meter. To address these challenges, we introduce LimGen, a novel and fully automated system for limerick generation that outperforms state-of-the-art neural network-based poetry models, as well as prior rule-based poetry models. LimGen consists of three important pieces: the Adaptive Multi-Templated Constraint algorithm that constrains our search to the space of realistic poems, the Multi-Templated Beam Search algorithm which searches efficiently through the space, and the probabilistic Storyline algorithm that provides coherent storylines related to a user-provided prompt word. The resulting limericks satisfy poetic constraints and have thematically coherent storylines, which are sometimes even funny (when we are lucky).

preprint2021arXiv

Towards Practical Lipschitz Bandits

Stochastic Lipschitz bandit algorithms balance exploration and exploitation, and have been used for a variety of important task domains. In this paper, we present a framework for Lipschitz bandit methods that adaptively learns partitions of context- and arm-space. Due to this flexibility, the algorithm is able to efficiently optimize rewards and minimize regret, by focusing on the portions of the space that are most relevant. In our analysis, we link tree-based methods to Gaussian processes. In light of our analysis, we design a novel hierarchical Bayesian model for Lipschitz bandit problems. Our experiments show that our algorithms can achieve state-of-the-art performance in challenging real-world tasks such as neural network hyperparameter tuning.

preprint2020arXiv

Adaptive Hyper-box Matching for Interpretable Individualized Treatment Effect Estimation

We propose a matching method for observational data that matches units with others in unit-specific, hyper-box-shaped regions of the covariate space. These regions are large enough that many matches are created for each unit and small enough that the treatment effect is roughly constant throughout. The regions are found as either the solution to a mixed integer program, or using a (fast) approximation algorithm. The result is an interpretable and tailored estimate of a causal effect for each unit.

preprint2020arXiv

Almost-Matching-Exactly for Treatment Effect Estimation under Network Interference

We propose a matching method that recovers direct treatment effects from randomized experiments where units are connected in an observed network, and units that share edges can potentially influence each others' outcomes. Traditional treatment effect estimators for randomized experiments are biased and error prone in this setting. Our method matches units almost exactly on counts of unique subgraphs within their neighborhood graphs. The matches that we construct are interpretable and high-quality. Our method can be extended easily to accommodate additional unit-level covariate information. We show empirically that our method performs better than other existing methodologies for this problem, while producing meaningful, interpretable results.

preprint2020arXiv

Bandits for BMO Functions

We study the bandit problem where the underlying expected reward is a Bounded Mean Oscillation (BMO) function. BMO functions are allowed to be discontinuous and unbounded, and are useful in modeling signals with infinities in the do-main. We develop a toolset for BMO bandits, and provide an algorithm that can achieve poly-log $δ$-regret -- a regret measured against an arm that is optimal after removing a $δ$-sized portion of the arm space.

preprint2020arXiv

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

The primary aim of single-image super-resolution is to construct high-resolution (HR) images from corresponding low-resolution (LR) inputs. In previous approaches, which have generally been supervised, the training objective typically measures a pixel-wise average distance between the super-resolved (SR) and HR images. Optimizing such metrics often leads to blurring, especially in high variance (detailed) regions. We propose an alternative formulation of the super-resolution problem based on creating realistic SR images that downscale correctly. We present an algorithm addressing this problem, PULSE (Photo Upsampling via Latent Space Exploration), which generates high-resolution, realistic images at resolutions previously unseen in the literature. It accomplishes this in an entirely self-supervised fashion and is not confined to a specific degradation operator used during training, unlike previous methods (which require supervised training on databases of LR-HR image pairs). Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the "downscaling loss," which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, we restrict the search space to guarantee realistic outputs. PULSE thereby generates super-resolved images that both are realistic and downscale correctly. We show proof of concept of our approach in the domain of face super-resolution (i.e., face hallucination). We also present a discussion of the limitations and biases of the method as currently implemented with an accompanying model card with relevant metrics. Our method outperforms state-of-the-art methods in perceptual quality at higher resolutions and scale factors than previously possible.

preprint2020arXiv

Variable Importance Clouds: A Way to Explore Variable Importance for the Set of Good Models

Variable importance is central to scientific studies, including the social sciences and causal inference, healthcare, and other domains. However, current notions of variable importance are often tied to a specific predictive model. This is problematic: what if there were multiple well-performing predictive models, and a specific variable is important to some of them and not to others? In that case, we may not be able to tell from a single well-performing model whether a variable is always important in predicting the outcome. Rather than depending on variable importance for a single predictive model, we would like to explore variable importance for all approximately-equally-accurate predictive models. This work introduces the concept of a variable importance cloud, which maps every variable to its importance for every good predictive model. We show properties of the variable importance cloud and draw connections to other areas of statistics. We introduce variable importance diagrams as a projection of the variable importance cloud into two dimensions for visualization purposes. Experiments with criminal justice, marketing data, and image classification tasks illustrate how variables can change dramatically in importance for approximately-equally-accurate predictive models

preprint2019arXiv

The age of secrecy and unfairness in recidivism prediction

In our current society, secret algorithms make important decisions about individuals. There has been substantial discussion about whether these algorithms are unfair to groups of individuals. While noble, this pursuit is complex and ultimately stagnating because there is no clear definition of fairness and competing definitions are largely incompatible. We argue that the focus on the question of fairness is misplaced, as these algorithms fail to meet a more important and yet readily obtainable goal: transparency. As a result, creators of secret algorithms can provide incomplete or misleading descriptions about how their models work, and various other kinds of errors can easily go unnoticed. By partially reverse engineering the COMPAS algorithm -- a recidivism-risk scoring algorithm used throughout the criminal justice system -- we show that it does not seem to depend linearly on the defendant's age, despite statements to the contrary by the algorithm's creator. Furthermore, by subtracting from COMPAS its (hypothesized) nonlinear age component, we show that COMPAS does not necessarily depend on race, contradicting ProPublica's analysis, which assumed linearity in age. In other words, faulty assumptions about a proprietary algorithm lead to faulty conclusions that go unchecked without careful reverse engineering. Were the algorithm transparent in the first place, this would likely not have occurred. The most important result in this work is that we find that there are many defendants with low risk score but long criminal histories, suggesting that data inconsistencies occur frequently in criminal justice databases. We argue that transparency satisfies a different notion of procedural fairness by providing both the defendants and the public with the opportunity to scrutinize the methodology and calculations behind risk scores for recidivism.

preprint2019arXiv

This Looks Like That: Deep Learning for Interpretable Image Recognition

When we are faced with challenging image classification tasks, we often explain our reasoning by dissecting the image, and pointing out prototypical aspects of one class or another. The mounting evidence for each of the classes helps us make our final decision. In this work, we introduce a deep network architecture -- prototypical part network (ProtoPNet), that reasons in a similar way: the network dissects the image by finding prototypical parts, and combines evidence from the prototypes to make a final classification. The model thus reasons in a way that is qualitatively similar to the way ornithologists, physicians, and others would explain to people on how to solve challenging image classification tasks. The network uses only image-level labels for training without any annotations for parts of images. We demonstrate our method on the CUB-200-2011 dataset and the Stanford Cars dataset. Our experiments show that ProtoPNet can achieve comparable accuracy with its analogous non-interpretable counterpart, and when several ProtoPNets are combined into a larger network, it can achieve an accuracy that is on par with some of the best-performing deep models. Moreover, ProtoPNet provides a level of interpretability that is absent in other interpretable deep models.