Source author record

Philip B. Stark

Philip B. Stark appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Cryptography and Security cs.CY Methodology math.PR Artificial Intelligence Computation Computer Science and Game Theory math.ST stat.OT Statistics Theory

Catalog footprint

What is connected

16works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

ALPHA: Audit that Learns from Previously Hand-Audited Ballots

BRAVO, the most widely tried method for risk-limiting election audits, cannot accommodate sampling without replacement or stratified sampling, which can improve efficiency and may be required by law. It applies only to ballot-polling audits, which are less efficient than comparison audits. It applies to plurality, majority, super-majority, proportional representation, and ranked-choice voting contests, but not to many social choice functions for which there are RLA methods, such as approval voting, STAR-voting, Borda count, and general scoring rules. And while BRAVO has the smallest expected sample size among sequentially valid ballot-polling-with-replacement methods when reported vote shares are exactly right, it can require arbitrarily large samples when the reported reported winner(s) really won but reported vote shares are wrong. ALPHA is a simple generalization of BRAVO that (i) works for sampling with and without replacement and Bernoulli sampling; (ii) increases power for stratified audits by avoiding the need to use a $P$-value combining function or to maximize $P$-values over nuisance parameters within strata, and allowing adaptive sampling across strata; (iii) works not only for ballot-polling but also for ballot-level comparison, batch-polling, and batch-level comparison audits, sampling with or without replacement, uniformly or with weights proportional to size; (iv) works for all social choice functions covered by SHANGRLA; and (v) in situations where both ALPHA and BRAVO apply, requires smaller samples than BRAVO when the reported vote shares are wrong but the outcome is correct--five orders of magnitude in some examples. ALPHA includes the family of betting martingale tests in RiLACS, with a different betting strategy parametrized as an estimator of the population mean and explicit flexibility to accommodate sampling weights and population bounds that vary by draw.

preprint2022arXiv

Assessing the accuracy of the Australian Senate count: Key steps for a rigorous and transparent audit

This paper explains the main principles and some of the technical details for auditing the scanning and digitisation of the Australian Senate ballot papers. We give a short summary of the motivation for auditing paper ballots, explain the necessary supporting steps for a rigorous and transparent audit, and suggest some statistical methods that would be appropriate for the Australian Senate. 22 June 2022 Update: The update includes analysis of Senate preference data from the 2022 Australian election.

preprint2022arXiv

Comment on "The statistics wars and intellectual conflicts of interest" by D. Mayo

While P-values are widely abused, they are a useful tool for many purposes; banning them is analogous to banning scalpels because most people do not know how to perform surgery. Many reported P-values are not genuine P-values, for a variety of reasons. Perhaps the most widespread and pernicious problem is the Type III error of testing a statistical hypothesis that has little or no connection to the scientific hypothesis.

preprint2022arXiv

Sweeter than SUITE: Supermartingale Stratified Union-Intersection Tests of Elections

Stratified sampling can be useful in risk-limiting audits (RLAs), for instance, to accommodate heterogeneous voting equipment or laws that mandate jurisdictions draw their audit samples independently. We combine the union-intersection tests in SUITE, the reduction of RLAs to testing whether the means of a collection of lists are all $\leq 1/2$ of SHANGRLA, and the nonnegative supermartingale (NNSM) tests in ALPHA to improve the efficiency and flexibility of stratified RLAs. A simple, non-adaptive strategy for combining stratumwise NNSMs decreases the measured risk in the 2018 pilot hybrid audit in Kalamazoo, Michigan, USA by more than an order of magnitude, from 0.037 for SUITE to 0.003 for our method. We give a simple, computationally inexpensive, adaptive rule for deciding which stratum to sample next that reduces audit workload by as much as 74% in examples. We also present NNSM-based tests that are computationally tractable even when there are many strata, illustrated with a simulated audit stratified across California's 58 counties.

preprint2022arXiv

They may look and look, yet not see: BMDs cannot be tested adequately

Bugs, misconfiguration, and malware can cause ballot-marking devices (BMDs) to print incorrect votes. Several approaches to testing BMDs have been proposed. In logic and accuracy testing (LAT) and parallel or live testing, auditors input known test votes into the BMD and check the printout. Passive testing monitors the rate of "spoiled" BMD printout, on the theory that if BMDs malfunction, the rate will increase noticeably. We show that these approaches cannot reliably detect outcome-altering problems, because: (i) The number of possible interactions with BMDs is enormous, so testing interactions uniformly at random is hopeless. (ii) To probe the space of interactions intelligently requires an accurate model of voter behavior, but because the space of interactions is so large, building an accurate model requires observing a huge number of voters in every jurisdiction in every election--more voters than there are in most jurisdictions. (iii) Even with a perfect model of voter behavior, the number of tests needed exceeds the number of voters in most jurisdictions. (iv) An attacker can target interactions that are expensive to test, e.g., because they involve voting slowly; or interactions for which tampering is less likely to be noticed, e.g., because the voter uses the audio interface. (v) Whether BMDs misbehave or not, the distribution of spoiled ballots is unknown and varies by election and possibly by ballot style: historical data do not help much. Hence, there is no way to calibrate a threshold for passive testing, e.g., to guarantee at least a 95% chance of noticing that 5% of the votes were altered, with at most a 5% false alarm rate. (vi) Even if the distribution of spoiled ballots were known to be Poisson, the vast majority of jurisdictions do not have enough voters for passive testing to have a large chance of detecting problems but only a small chance of false alarms.

preprint2020arXiv

Sets of Half-Average Nulls Generate Risk-Limiting Audits: SHANGRLA

Risk-limiting audits (RLAs) for many social choice functions can be reduced to testing sets of null hypotheses of the form "the average of this list is not greater than 1/2" for a collection of finite lists of nonnegative numbers. Such social choice functions include majority, super-majority, plurality, multi-winner plurality, Instant Runoff Voting (IRV), Borda count, approval voting, and STAR-Voting, among others. The audit stops without a full hand count iff all the null hypotheses are rejected. The nulls can be tested in many ways. Ballot-polling is particularly simple; two new ballot-polling risk-measuring functions for sampling without replacement are given. Ballot-level comparison audits transform each null into an equivalent assertion that the mean of re-scaled tabulation errors is not greater than 1/2. In turn, that null can then be tested using the same statistical methods used for ballot polling---but applied to different finite lists of nonnegative numbers. SHANGRLA comparison audits are more efficient than previous comparison audits for two reasons: (i) for most social choice functions, the conditions tested are both necessary and sufficient for the reported outcome to be correct, while previous methods tested conditions that were sufficient but not necessary, and (ii) the tests avoid a conservative approximation. The SHANGRLA abstraction simplifies stratified audits, including audits that combine ballot polling with ballot-level comparisons, producing sharper audits than the "SUITE" approach. SHANGRLA works with the "phantoms to evil zombies" strategy to treat missing ballot cards and missing or redacted cast vote records. That also facilitates sampling from "ballot-style manifests," which can dramatically improve efficiency when the audited contests do not appear on every ballot card. Open-source software implementing SHANGRLA ballot-level comparison audits is available.

preprint2020arXiv

You can do RLAs for IRV

The City and County of San Francisco, CA, has used Instant Runoff Voting (IRV) for some elections since 2004. This report describes the first ever process pilot of Risk Limiting Audits for IRV, for the San Francisco District Attorney's race in November, 2019. We found that the vote-by-mail outcome could be efficiently audited to well under the 0.05 risk limit given a sample of only 200 ballots. All the software we developed for the pilot is open source.

preprint2016arXiv

Auditing Australian Senate Ballots

We explain why the Australian Electoral Commission should perform an audit of the paper Senate ballots against the published preference data files. We suggest four different post-election audit methods appropriate for Australian Senate elections. We have developed prototype code for all of them and tested it on preference data from the 2016 election.

preprint2016arXiv

Leading the field: Fortune favors the bold in Thurstonian choice models

Schools with the highest average student performance are often the smallest schools; localities with the highest rates of some cancers are frequently small and the effects observed in clinical trials are likely to be largest for the smallest numbers of subjects. Informal explanations of this "small-schools phenomenon" point to the fact that the sample means of smaller samples have higher variances. But this cannot be a complete explanation: If we draw two samples from a diffuse distribution that is symmetric about some point, then the chance that the smaller sample has larger mean is 50\%. A particular consequence of results proved below is that if one draws three or more samples of different sizes from the same normal distribution, then the sample mean of the smallest sample is most likely to be highest, the sample mean of the second smallest sample is second most likely to be highest, and so on; this is true even though for any pair of samples, each one of the pair is equally likely to have the larger sample mean. Our conclusions are relevant to certain stochastic choice models including the following generalization of Thurstone's Law of Comparative Judgment. There are $n$ items. Item $i$ is preferred to item $j$ if $Z_i < Z_j$, where $Z$ is a random $n$-vector of preference scores. Suppose $\mathbb{P}\{Z_i = Z_j\} = 0$ for $i \ne j$, so there are no ties. Item $k$ is the favorite if $Z_k < \min_{i\ne k} Z_i$. Let $p_i$ denote the chance that item $i$ is the favorite. We characterize a large class of distributions for $Z$ for which $p_1 > p_2 > \cdots > p_n$. Our results are most surprising when $\mathbb{P}\{Z_i < Z_j\} = \mathbb{P}\{Z_i > Z_j\} = \frac{1}{2}$ for $i \ne j$, so neither of any two items is likely to be preferred over the other in a pairwise comparison.

preprint2015arXiv

Mini-Minimax Uncertainty Quantification for Emulators

Consider approximating a "black box" function $f$ by an emulator $\hat{f}$ based on $n$ noiseless observations of $f$. Let $w$ be a point in the domain of $f$. How big might the error $|\hat{f}(w) - f(w)|$ be? If $f$ could be arbitrarily rough, this error could be arbitrarily large: we need some constraint on $f$ besides the data. Suppose $f$ is Lipschitz with known constant. We find a lower bound on the number of observations required to ensure that for the best emulator $\hat{f}$ based on the $n$ data, $|\hat{f}(w) - f(w)| \le ε$. But in general, we will not know whether $f$ is Lipschitz, much less know its Lipschitz constant. Assume optimistically that $f$ is Lipschitz-continuous with the smallest constant consistent with the $n$ data. We find the maximum (over such regular $f$) of $|\hat{f}(w) - f(w)|$ for the best possible emulator $\hat{f}$; we call this the "mini-minimax uncertainty" at $w$. In reality, $f$ might not be Lipschitz or---if it is---it might not attain its Lipschitz constant on the data. Hence, the mini-minimax uncertainty at $w$ could be much smaller than $|\hat{f}(w) - f(w)|$. But if the mini-minimax uncertainty is large, then---even if $f$ satisfies the optimistic regularity assumption---$|\hat{f}(w) - f(w)|$ could be large, no matter how cleverly we choose $\hat{f}$. For the Community Atmosphere Model, the maximum (over $w$) of the mini-minimax uncertainty based on a set of 1154~observations of $f$ is no smaller than it would be for a single observation of $f$ at the centroid of the 21-dimensional parameter space. We also find lower confidence bounds for quantiles of the mini-minimax uncertainty and its mean over the domain of $f$. For the Community Atmosphere Model, these lower confidence bounds are an appreciable fraction of the maximum.

preprint2015arXiv

Some people have all the luck

We look at the Florida Lottery records of winners of prizes worth $600 or more. Some individuals claimed large numbers of prizes. Were they lucky, or up to something? We distinguish the "plausibly lucky" from the "implausibly lucky" by solving optimization problems that take into account the particular games each gambler won, where plausibility is determined by finding the minimum expenditure so that if every Florida resident spent that much, the chance that any of them would win as often as the gambler did would still be less than one in a million. Dealing with dependent bets relies on the BKR inequality; solving the optimization problem numerically relies on the log-concavity of the regularized Beta function. Subsequent investigation by law enforcement confirmed that the gamblers we identified as "implausibly lucky" were indeed behaving illegally.

preprint2014arXiv

Only the Bad Die Young: Restaurant Mortality in the Western US

Do 9 out of 10 restaurants fail in their first year, as commonly claimed? No. Survival analysis of 1.9 million longitudinal microdata for 81,000 full-service restaurants in a 20-year U.S. Bureau of Labor Statistics non-public census of business establishments in the western US shows that only 17 percent of independently owned full-service restaurant startups failed in their first year, compared with 19 percent for all other service-providing startups. The median lifespan of restaurants is about 4.5 years, slightly longer than that of other service businesses (4.25 years). However, the median lifespan of a restaurant startup with 5 or fewer employees is 3.75 years, slightly shorter than that of other service businesses of the same startup size (4.0 years).

preprint2012arXiv

Limiting Risk by Turning Manifest Phantoms into Evil Zombies

Drawing a random sample of ballots to conduct a risk-limiting audit generally requires knowing how the ballots cast in an election are organized into groups, for instance, how many containers of ballots there are in all and how many ballots are in each container. A list of the ballot group identifiers along with number of ballots in each group is called a ballot manifest. What if the ballot manifest is not accurate? Surprisingly, even if ballots are known to be missing from the manifest, it is not necessary to make worst-case assumptions about those ballots--for instance, to adjust the margin by the number of missing ballots--to ensure that the audit remains conservative. Rather, it suffices to make worst-case assumptions about the individual randomly selected ballots that the audit cannot find. This observation provides a simple modification to some risk-limiting audit procedures that makes them automatically become more conservative if the ballot manifest has errors. The modification--phantoms to evil zombies (~2EZ)--requires only an upper bound on the total number of ballots cast. ~2EZ makes the audit P-value stochastically larger than it would be had the manifest been accurate, automatically requiring more than enough ballots to be audited to offset the manifest errors. This ensures that the true risk limit remains smaller than the nominal risk limit. On the other hand, if the manifest is in fact accurate and the upper bound on the total number of ballots equals the total according to the manifest, ~2EZ has no effect at all on the number of ballots audited nor on the true risk limit.

preprint2012arXiv

STAR-Vote: A Secure, Transparent, Auditable, and Reliable Voting System

In her 2011 EVT/WOTE keynote, Travis County, Texas County Clerk Dana DeBeauvoir described the qualities she wanted in her ideal election system to replace their existing DREs. In response, in April of 2012, the authors, working with DeBeauvoir and her staff, jointly architected STAR-Vote, a voting system with a DRE-style human interface and a "belt and suspenders" approach to verifiability. It provides both a paper trail and end-to-end cryptography using COTS hardware. It is designed to support both ballot-level risk-limiting audits, and auditing by individual voters and observers. The human interface and process flow is based on modern usability research. This paper describes the STAR-Vote architecture, which could well be the next-generation voting system for Travis County and perhaps elsewhere.

preprint2011arXiv

SOBA: Secrecy-preserving Observable Ballot-level Audit

SOBA is an approach to election verification that provides observers with justifiably high confidence that the reported results of an election are consistent with an audit trail ("ballots"), which can be paper or electronic. SOBA combines three ideas: (1) publishing cast vote records (CVRs) separately for each contest, so that anyone can verify that each reported contest outcome is correct, if the CVRs reflect voters' intentions with sufficient accuracy; (2) shrouding a mapping between ballots and the CVRs for those ballots to prevent the loss of privacy that could occur otherwise; (3) assessing the accuracy with which the CVRs reflect voters' intentions for a collection of contests while simultaneously assessing the integrity of the shrouded mapping between ballots and CVRs by comparing randomly selected ballots to the CVRs that purport to represent them. Step (1) is related to work by the Humboldt County Election Transparency Project, but publishing CVRs separately for individual contests rather than images of entire ballots preserves privacy. Step (2) requires a cryptographic commitment from elections officials. Observers participate in step (3), which relies on the "super-simple simultaneous single-ballot risk-limiting audit." Step (3) is designed to reveal relatively few ballots if the shrouded mapping is proper and the CVRs accurately reflect voter intent. But if the reported outcomes of the contests differ from the outcomes that a full hand count would show, step (3) is guaranteed to have a large chance of requiring all the ballots to be counted by hand, thereby limiting the risk that an incorrect outcome will become official and final.

preprint2009arXiv

Implementing Risk-Limiting Post-Election Audits in California

Risk-limiting post-election audits limit the chance of certifying an electoral outcome if the outcome is not what a full hand count would show. Building on previous work, we report on pilot risk-limiting audits in four elections during 2008 in three California counties: one during the February 2008 Primary Election in Marin County and three during the November 2008 General Elections in Marin, Santa Cruz and Yolo Counties. We explain what makes an audit risk-limiting and how existing and proposed laws fall short. We discuss the differences among our four pilot audits. We identify challenges to practical, efficient risk-limiting audits and conclude that current approaches are too complex to be used routinely on a large scale. One important logistical bottleneck is the difficulty of exporting data from commercial election management systems in a format amenable to audit calculations. Finally, we propose a bare-bones risk-limiting audit that is less efficient than these pilot audits, but avoids many practical problems.

Philip B. Stark

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

ALPHA: Audit that Learns from Previously Hand-Audited Ballots

Assessing the accuracy of the Australian Senate count: Key steps for a rigorous and transparent audit

Comment on "The statistics wars and intellectual conflicts of interest" by D. Mayo

Sweeter than SUITE: Supermartingale Stratified Union-Intersection Tests of Elections

They may look and look, yet not see: BMDs cannot be tested adequately

Sets of Half-Average Nulls Generate Risk-Limiting Audits: SHANGRLA

You can do RLAs for IRV

Auditing Australian Senate Ballots

Leading the field: Fortune favors the bold in Thurstonian choice models

Mini-Minimax Uncertainty Quantification for Emulators

Some people have all the luck

Only the Bad Die Young: Restaurant Mortality in the Western US

Limiting Risk by Turning Manifest Phantoms into Evil Zombies

STAR-Vote: A Secure, Transparent, Auditable, and Reliable Voting System

SOBA: Secrecy-preserving Observable Ballot-level Audit

Implementing Risk-Limiting Post-Election Audits in California