Source author record

Donald B. Rubin

Donald B. Rubin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications math.ST Statistics Theory Machine Learning Social and Information Networks

Catalog footprint

What is connected

10works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Causal inference from treatment-control studies having an additional factor with unknown assignment mechanism

Consider a situation with two treatments, the first of which is randomized but the second is not, and the multifactor version of this. Interest is in treatment effects, defined using standard factorial notation. We define estimators for the treatment effects and explore their properties when there is information about the nonrandomized treatment assignment and when there is no information on the assignment of the nonrandomized treatment. We show when and how hidden treatments can bias estimators and inflate their sampling variances.

preprint2021arXiv

Automatic Detection of Influential Actors in Disinformation Networks

The weaponization of digital communications and social media to conduct disinformation campaigns at immense scale, speed, and reach presents new challenges to identify and counter hostile influence operations (IOs). This paper presents an end-to-end framework to automate detection of disinformation narratives, networks, and influential actors. The framework integrates natural language processing, machine learning, graph analytics, and a novel network causal inference approach to quantify the impact of individual actors in spreading IO narratives. We demonstrate its capability on real-world hostile IO campaigns with Twitter datasets collected during the 2017 French presidential elections, and known IO accounts disclosed by Twitter over a broad range of IO campaigns (May 2007 to February 2020), over 50,000 accounts, 17 countries, and different account types including both trolls and bots. Our system detects IO accounts with 96% precision, 79% recall, and 96% area-under-the-PR-curve, maps out salient network communities, and discovers high-impact accounts that escape the lens of traditional impact statistics based on activity counts and network centrality. Results are corroborated with independent sources of known IO accounts from U.S. Congressional reports, investigative journalism, and IO datasets provided by Twitter.

preprint2021arXiv

PCA Rerandomization

Mahalanobis distance between treatment group and control group covariate means is often adopted as a balance criterion when implementing a rerandomization strategy. However, this criterion may not work well for high-dimensional cases because it balances all orthogonalized covariates equally. Here, we propose leveraging principal component analysis (PCA) to identify proper subspaces in which Mahalanobis distance should be calculated. Not only can PCA effectively reduce the dimensionality for high-dimensional cases while capturing most of the information in the covariates, but it also provides computational simplicity by focusing on the top orthogonal components. We show that our PCA rerandomization scheme has desirable theoretical properties on balancing covariates and thereby on improving the estimation of average treatment effects. We also show that this conclusion is supported by numerical studies using both simulated and real examples.

preprint2016arXiv

Causal Inference in Rebuilding and Extending the Recondite Bridge between Finite Population Sampling and Experimental Design

This article considers causal inference for treatment contrasts from a randomized experiment using potential outcomes in a finite population setting. Adopting a Neymanian repeated sampling approach that integrates such causal inference with finite population survey sampling, an inferential framework is developed for general mechanisms of assigning experimental units to multiple treatments. This framework extends classical methods by allowing the possibility of randomization restrictions and unequal replications. Novel conditions that are "milder" than strict additivity of treatment effects, yet permit unbiased estimation of the finite population sampling variance of any treatment contrast estimator, are derived. The consequences of departures from such conditions are also studied under the criterion of minimax bias, and a new justification for using the Neymanian conservative sampling variance estimator in experiments is provided. The proposed approach can readily be extended to the case of treatments with a general factorial structure.

preprint2015arXiv

Assessing the Potential Impact of a Nationwide Class-Based Affirmative Action System

We examine the possible consequences of a change in law school admissions in the United States from an affirmative action system based on race to one based on socioeconomic class. Using data from the 1991-1996 Law School Admission Council Bar Passage Study, students were reassigned attendance by simulation to law school tiers by transferring the affirmative action advantage for black students to students from low socioeconomic backgrounds. The hypothetical academic outcomes for the students were then multiply-imputed to quantify the uncertainty of the resulting estimates. The analysis predicts dramatic decreases in the numbers of black students in top law school tiers, suggesting that class-based affirmative action is insufficient to maintain racial diversity in prestigious law schools. Furthermore, there appear to be no statistically significant changes in the graduation and bar passage rates of students in any demographic group. The results thus provide evidence that, other than increasing their representation in upper tiers, current affirmative action policies relative to a socioeconomic-based system neither substantially help nor harm minority academic outcomes, contradicting the predictions of the "mismatch" hypothesis, which asserts otherwise.

preprint2015arXiv

Causal inference for ordinal outcomes

Many outcomes of interest in the social and health sciences, as well as in modern applications in computational social science and experimentation on social media platforms, are ordinal and do not have a meaningful scale. Causal analyses that leverage this type of data, termed ordinal non-numeric, require careful treatment, as much of the classical potential outcomes literature is concerned with estimation and hypothesis testing for outcomes whose relative magnitudes are well defined. Here, we propose a class of finite population causal estimands that depend on conditional distributions of the potential outcomes, and provide an interpretable summary of causal effects when no scale is available. We formulate a relaxation of the Fisherian sharp null hypothesis of constant effect that accommodates the scale-free nature of ordinal non-numeric data. We develop a Bayesian procedure to estimate the proposed causal estimands that leverages the rank likelihood. We illustrate these methods with an application to educational outcomes in the General Social Survey.

preprint2015arXiv

Improving Covariate Balance in 2^K Factorial Designs via Rerandomization

Factorial designs are widely used in agriculture, engineering, and the social sciences to study the causal effects of several factors simultaneously on a response. The objective of such a design is to estimate all factorial effects of interest, which typically include main effects and interactions among factors. To estimate factorial effects with high precision when a large number of pre-treatment covariates are present, balance among covariates across treatment groups should be ensured. We propose utilizing rerandomization to ensure covariate balance in factorial designs. Although both factorial designs and rerandomization have been discussed before, the combination has not. Here, theoretical properties of rerandomization for factorial designs are established, and empirical results are explored using an application from the New York Department of Education.

preprint2014arXiv

Comments on the Neyman-Fisher Controversy and Its Consequences

The Neyman-Fisher controversy considered here originated with the 1935 presentation of Jerzy Neyman's Statistical Problems in Agricultural Experimentation to the Royal Statistical Society. Neyman asserted that the standard ANOVA F-test for randomized complete block designs is valid, whereas the analogous test for Latin squares is invalid in the sense of detecting differentiation among the treatments, when none existed on average, more often than desired (i.e., having a higher Type I error than advertised). However, Neyman's expressions for the expected mean residual sum of squares, for both designs, are generally incorrect. Furthermore, Neyman's belief that the Type I error (when testing the null hypothesis of zero average treatment effects) is higher than desired, whenever the expected mean treatment sum of squares is greater than the expected mean residual sum of squares, is generally incorrect. Simple examples show that, without further assumptions on the potential outcomes, one cannot determine the Type I error of the F-test from expected sums of squares. Ultimately, we believe that the Neyman-Fisher controversy had a deleterious impact on the development of statistics, with a major consequence being that potential outcomes were ignored in favor of linear models and classical statistical procedures that are imprecise without applied contexts.

preprint2012arXiv

Causal inference from $2^k$ factorial designs using the potential outcomes model

A framework for causal inference from two-level factorial designs is proposed. The framework utilizes the concept of potential outcomes that lies at the center stage of causal inference and extends Neyman's repeated sampling approach for estimation of causal effects and randomization tests based on Fisher's sharp null hypothesis to the case of 2-level factorial experiments. The framework allows for statistical inference from a finite population, permits definition and estimation of estimands other than "average factorial effects" and leads to more flexible inference procedures than those based on ordinary least squares estimation from a linear model.

preprint2012arXiv

Rerandomization to improve covariate balance in experiments

Randomized experiments are the "gold standard" for estimating causal effects, yet often in practice, chance imbalances exist in covariate distributions between treatment groups. If covariate data are available before units are exposed to treatments, these chance imbalances can be mitigated by first checking covariate balance before the physical experiment takes place. Provided a precise definition of imbalance has been specified in advance, unbalanced randomizations can be discarded, followed by a rerandomization, and this process can continue until a randomization yielding balance according to the definition is achieved. By improving covariate balance, rerandomization provides more precise and trustworthy estimates of treatment effects.

Donald B. Rubin

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Causal inference from treatment-control studies having an additional factor with unknown assignment mechanism

Automatic Detection of Influential Actors in Disinformation Networks

PCA Rerandomization

Causal Inference in Rebuilding and Extending the Recondite Bridge between Finite Population Sampling and Experimental Design

Assessing the Potential Impact of a Nationwide Class-Based Affirmative Action System

Causal inference for ordinal outcomes

Improving Covariate Balance in 2^K Factorial Designs via Rerandomization

Comments on the Neyman-Fisher Controversy and Its Consequences

Causal inference from $2^k$ factorial designs using the potential outcomes model

Rerandomization to improve covariate balance in experiments