Source author record

Omkar Muralidharan

Omkar Muralidharan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Machine Learning Methodology Computer Science and Game Theory

Catalog footprint

What is connected

6works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2015arXiv

Second Order Calibration: A Simple Way to Get Approximate Posteriors

Many large-scale machine learning problems involve estimating an unknown parameter $θ_{i}$ for each of many items. For example, a key problem in sponsored search is to estimate the click through rate (CTR) of each of billions of query-ad pairs. Most common methods, though, only give a point estimate of each $θ_{i}$. A posterior distribution for each $θ_{i}$ is usually more useful but harder to get. We present a simple post-processing technique that takes point estimates or scores $t_{i}$ (from any method) and estimates an approximate posterior for each $θ_{i}$. We build on the idea of calibration, a common post-processing technique that estimates $\mathrm{E}\left(θ_{i}\!\!\bigm|\!\! t_{i}\right)$. Our method, second order calibration, uses empirical Bayes methods to estimate the distribution of $θ_{i}\!\!\bigm|\!\! t_{i}$ and uses the estimated distribution as an approximation to the posterior distribution of $θ_{i}$. We show that this can yield improved point estimates and useful accuracy estimates. The method scales to large problems - our motivating example is a CTR estimation problem involving tens of billions of query-ad pairs.

preprint2015arXiv

Teaching Statistics at Google Scale

Modern data and applications pose very different challenges from those of the 1950s or even the 1980s. Students contemplating a career in statistics or data science need to have the tools to tackle problems involving massive, heavy-tailed data, often interacting with live, complex systems. However, despite the deepening connections between engineering and modern data science, we argue that training in classical statistical concepts plays a central role in preparing students to solve Google-scale problems. To this end, we present three industrial applications where significant modern data challenges were overcome by statistical thinking.

preprint2014arXiv

Feedback Detection for Live Predictors

A predictor that is deployed in a live production system may perturb the features it uses to make predictions. Such a feedback loop can occur, for example, when a model that predicts a certain type of behavior ends up causing the behavior it predicts, thus creating a self-fulfilling prophecy. In this paper we analyze predictor feedback detection as a causal inference problem, and introduce a local randomization scheme that can be used to detect non-linear feedback in real-world problems. We conduct a pilot study for our proposed methodology using a predictive system currently deployed as a part of a search engine.

preprint2012arXiv

Detecting mutations in mixed sample sequencing data using empirical Bayes

We develop statistically based methods to detect single nucleotide DNA mutations in next generation sequencing data. Sequencing generates counts of the number of times each base was observed at hundreds of thousands to billions of genome positions in each sample. Using these counts to detect mutations is challenging because mutations may have very low prevalence and sequencing error rates vary dramatically by genome position. The discreteness of sequencing data also creates a difficult multiple testing problem: current false discovery rate methods are designed for continuous data, and work poorly, if at all, on discrete data. We show that a simple randomization technique lets us use continuous false discovery rate methods on discrete data. Our approach is a useful way to estimate false discovery rates for any collection of discrete test statistics, and is hence not limited to sequencing data. We then use an empirical Bayes model to capture different sources of variation in sequencing error rates. The resulting method outperforms existing detection approaches on example data sets.

preprint2012arXiv

On Calibrated Predictions for Auction Selection Mechanisms

Calibration is a basic property for prediction systems, and algorithms for achieving it are well-studied in both statistics and machine learning. In many applications, however, the predictions are used to make decisions that select which observations are made. This makes calibration difficult, as adjusting predictions to achieve calibration changes future data. We focus on click-through-rate (CTR) prediction for search ad auctions. Here, CTR predictions are used by an auction that determines which ads are shown, and we want to maximize the value generated by the auction. We show that certain natural notions of calibration can be impossible to achieve, depending on the details of the auction. We also show that it can be impossible to maximize auction efficiency while using calibrated predictions. Finally, we give conditions under which calibration is achievable and simultaneously maximizes auction efficiency: roughly speaking, bids and queries must not contain information about CTRs that is not already captured by the predictions.

preprint2010arXiv

An empirical Bayes mixture method for effect size and false discovery rate estimation

Many statistical problems involve data from thousands of parallel cases. Each case has some associated effect size, and most cases will have no effect. It is often important to estimate the effect size and the local or tail-area false discovery rate for each case. Most current methods do this separately, and most are designed for normal data. This paper uses an empirical Bayes mixture model approach to estimate both quantities together for exponential family data. The proposed method yields simple, interpretable models that can still be used nonparametrically. It can also estimate an empirical null and incorporate it fully into the model. The method outperforms existing effect size and false discovery rate estimation procedures in normal data simulations; it nearly acheives the Bayes error for effect size estimation. The method is implemented in an R package (mixfdr), freely available from CRAN.