Source author record

Dean P. Foster

Dean P. Foster appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.ST Methodology Statistics Theory math.OC Applications gr-qc Systems and Control

Catalog footprint

What is connected

16works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Few Expert Queries Suffices for Sample-Efficient RL with Resets and Linear Value Approximation

The current paper studies sample-efficient Reinforcement Learning (RL) in settings where only the optimal value function is assumed to be linearly-realizable. It has recently been understood that, even under this seemingly strong assumption and access to a generative model, worst-case sample complexities can be prohibitively (i.e., exponentially) large. We investigate the setting where the learner additionally has access to interactive demonstrations from an expert policy, and we present a statistically and computationally efficient algorithm (Delphi) for blending exploration with expert queries. In particular, Delphi requires $\tilde{\mathcal{O}}(d)$ expert queries and a $\texttt{poly}(d,H,|\mathcal{A}|,1/\varepsilon)$ amount of exploratory samples to provably recover an $\varepsilon$-suboptimal policy. Compared to pure RL approaches, this corresponds to an exponential improvement in sample complexity with surprisingly-little expert input. Compared to prior imitation learning (IL) approaches, our required number of expert demonstrations is independent of $H$ and logarithmic in $1/\varepsilon$, whereas all prior work required at least linear factors of both in addition to the same dependence on $d$. Towards establishing the minimal amount of expert queries needed, we show that, in the same setting, any learner whose exploration budget is polynomially-bounded (in terms of $d,H,$ and $|\mathcal{A}|$) will require at least $\tildeΩ(\sqrt{d})$ oracle calls to recover a policy competing with the expert's value function. Under the weaker assumption that the expert's policy is linear, we show that the lower bound increases to $\tildeΩ(d)$.

preprint2022arXiv

Impartial Predictive Modeling and the Use of Proxy Variables

Fairness aware data mining (FADM) aims to prevent algorithms from discriminating against protected groups. The literature has come to an impasse as to what constitutes explainable variability as opposed to discrimination. This distinction hinges on a rigorous understanding of the role of proxy variables; i.e., those variables which are associated both the protected feature and the outcome of interest. We demonstrate that fairness is achieved by ensuring impartiality with respect to sensitive characteristics and provide a framework for impartiality by accounting for different perspectives on the data generating process. In particular, fairness can only be precisely defined in a full-data scenario in which all covariates are observed. We then analyze how these models may be conservatively estimated via regression in partial-data settings. Decomposing the regression estimates provides insights into previously unexplored distinctions between explainable variability and discrimination that illuminate the use of proxy variables in fairness aware data mining.

preprint2022arXiv

The Benefits of Implicit Regularization from SGD in Least Squares Problems

Stochastic gradient descent (SGD) exhibits strong algorithmic regularization effects in practice, which has been hypothesized to play an important role in the generalization of modern machine learning approaches. In this work, we seek to understand these issues in the simpler setting of linear regression (including both underparameterized and overparameterized regimes), where our goal is to make sharp instance-based comparisons of the implicit regularization afforded by (unregularized) average SGD with the explicit regularization of ridge regression. For a broad class of least squares problem instances (that are natural in high-dimensional settings), we show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance. Taken together, our results show that, up to the logarithmic factors, the generalization performance of SGD is always no worse than that of ridge regression in a wide range of overparameterized problems, and, in fact, could be much better for some problem instances. More generally, our results show how algorithmic regularization has important consequences even in simpler (overparameterized) convex settings.

preprint2020arXiv

Fitting High-Dimensional Interaction Models with Error Control

There is a renewed interest in polynomial regression in the form of identifying influential interactions between features. In many settings, this takes place in a high-dimensional model, making the number of interactions unwieldy or computationally infeasible. Furthermore, it is difficult to analyze such spaces directly as they are often highly correlated. Standard feature selection issues remain such as how to determine a final model which generalizes well. This paper solves these problems with a sequential algorithm called Revisiting Alpha-Investing (RAI). RAI is motivated by the principle of marginality and searches the feature-space of higher-order interactions by greedily building upon lower-order terms. RAI controls a notion of false rejections and comes with a performance guarantee relative to the best-subset model. This ensures that signal is identified while providing a valid stopping criterion to prevent over-selection. We apply RAI in a novel setting over a family of regressions in order to select gene-specific interaction models for differential expression profiling.

preprint2016arXiv

Kernel ridge vs. principal component regression: minimax bounds and adaptability of regularization operators

Regularization is an essential element of virtually all kernel methods for nonparametric regression problems. A critical factor in the effectiveness of a given kernel method is the type of regularization that is employed. This article compares and contrasts members from a general class of regularization techniques, which notably includes ridge regression and principal component regression. We derive an explicit finite-sample risk bound for regularization-based estimators that simultaneously accounts for (i) the structure of the ambient function space, (ii) the regularity of the true regression function, and (iii) the adaptability (or qualification) of the regularization. A simple consequence of this upper bound is that the risk of the regularization-based estimators matches the minimax rate in a variety of settings. The general bound also illustrates how some regularization techniques are more adaptable than others to favorable regularity properties that the true regression function may possess. This, in particular, demonstrates a striking difference between kernel ridge regression and kernel principal component regression. Our theoretical results are supported by numerical experiments.

preprint2016arXiv

Orbiting Radiation Stars

We study a numerical solution to Einstein's equation for a compact object composed of null particles. The solution avoids quantum scale regimes and hence neither relies upon nor ignores the interaction of quantum mechanics and gravitation. The solution exhibits a deep gravitational well yet remains singularity free. In fact, the solution is geometrically flat in the vicinity of the origin with the flat region being of any desirable scale. The solution is also observationally distinct from a black hole because a photon from infinity aimed at an object centered on the origin passes through the origin and escapes to infinity with a time delay.

preprint2016arXiv

Submodularity in Statistics: Comparing the Success of Model Selection Methods

We demonstrate the usefulness of submodularity in statistics as a characterization of the difficulty of the \emph{search} problem of feature selection. The search problem is the ability of a procedure to identify an informative set of features as opposed to the performance of the optimal set of features. Submodularity arises naturally in this setting due to its connection to combinatorial optimization. In statistics, submodularity isolates cases in which collinearity makes the choice of model features difficult from those in which this task is routine. Researchers often report the signal-to-noise ratio to measure the difficulty of simulated data examples. A measure of submodularity should also be provided as it characterizes an independent component difficulty. Furthermore, it is closely related to other statistical assumptions used in the development of the Lasso, Dantzig selector, and sure information screening.

preprint2015arXiv

A Risk Ratio Comparison of $l_0$ and $l_1$ Penalized Regression

There has been an explosion of interest in using $l_1$-regularization in place of $l_0$-regularization for feature selection. We present theoretical results showing that while $l_1$-penalized linear regression never outperforms $l_0$-regularization by more than a constant factor, in some cases using an $l_1$ penalty is infinitely worse than using an $l_0$ penalty. We also show that the "optimal" $l_1$ solutions are often inferior to $l_0$ solutions found using stepwise regression. We also compare algorithms for solving these two problems and show that although solutions can be found efficiently for the $l_1$ problem, the "optimal" $l_1$ solutions are often inferior to $l_0$ solutions found using greedy classic stepwise regression. Furthermore, we show that solutions obtained by solving the convex $l_1$ problem can be improved by selecting the best of the $l_1$ models (for different regularization penalties) by using an $l_0$ criterion. In other words, an approximate solution to the right problem can be better than the exact solution to the wrong problem.

preprint2014arXiv

Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent

We propose a new two stage algorithm LING for large scale regression problems. LING has the same risk as the well known Ridge Regression under the fixed design setting and can be computed much faster. Our experiments have shown that LING performs well in terms of both prediction accuracy and computational efficiency compared with other large scale regression algorithms like Gradient Descent, Stochastic Gradient Descent and Principal Component Regression on both simulated and real datasets.

preprint2014arXiv

Large scale canonical correlation analysis with iterative least squares

Canonical Correlation Analysis (CCA) is a widely used statistical tool with both well established theory and favorable performance for a wide range of machine learning problems. However, computing CCA for huge datasets can be very slow since it involves implementing QR decomposition or singular value decomposition of huge matrices. In this paper we introduce L-CCA, a iterative algorithm which can compute CCA fast on huge sparse datasets. Theory on both the asymptotic convergence and finite time accuracy of L-CCA are established. The experiments also show that L-CCA outperform other fast CCA approximation schemes on two real datasets.

preprint2013arXiv

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

We compare the risk of ridge regression to a simple variant of ordinary least squares, in which one simply projects the data onto a finite dimensional subspace (as specified by a Principal Component Analysis) and then performs an ordinary (un-regularized) least squares regression in this subspace. This note shows that the risk of this ordinary least squares method is within a constant factor (namely 4) of the risk of ridge regression.

preprint2013arXiv

A Spectral Algorithm for Latent Dirichlet Allocation

The problem of topic modeling can be seen as a generalization of the clustering problem, in that it posits that observations are generated due to multiple latent factors (e.g., the words in each document are generated as a mixture of several active topics, as opposed to just one). This increased representational power comes at the cost of a more challenging unsupervised learning problem of estimating the topic probability vectors (the distributions over words for each topic), when only the words are observed and the corresponding topics are hidden. We provide a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of mixture models, including the popular latent Dirichlet allocation (LDA) model. For LDA, the procedure correctly recovers both the topic probability vectors and the prior over the topics, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, termed Excess Correlation Analysis (ECA), is based on a spectral decomposition of low order moments (third and fourth order) via two singular value decompositions (SVDs). Moreover, the algorithm is scalable since the SVD operations are carried out on $k\times k$ matrices, where $k$ is the number of latent factors (e.g. the number of topics), rather than in the $d$-dimensional observed space (typically $d \gg k$).

preprint2012arXiv

Optimal Weighting of Multi-View Data with Low Dimensional Hidden States

In Natural Language Processing (NLP) tasks, data often has the following two properties: First, data can be chopped into multi-views which has been successfully used for dimension reduction purposes. For example, in topic classification, every paper can be chopped into the title, the main text and the references. However, it is common that some of the views are less noisier than other views for supervised learning problems. Second, unlabeled data are easy to obtain while labeled data are relatively rare. For example, articles occurred on New York Times in recent 10 years are easy to grab but having them classified as 'Politics', 'Finance' or 'Sports' need human labor. Hence less noisy features are preferred before running supervised learning methods. In this paper we propose an unsupervised algorithm which optimally weights features from different views when these views are generated from a low dimensional hidden state, which occurs in widely used models like Mixture Gaussian Model, Hidden Markov Model (HMM) and Latent Dirichlet Allocation (LDA).

preprint2012arXiv

Spectral dimensionality reduction for HMMs

Hidden Markov Models (HMMs) can be accurately approximated using co-occurrence frequencies of pairs and triples of observations by using a fast spectral method in contrast to the usual slow methods like EM or Gibbs sampling. We provide a new spectral method which significantly reduces the number of model parameters that need to be estimated, and generates a sample complexity that does not depend on the size of the observation vocabulary. We present an elementary proof giving bounds on the relative accuracy of probability estimates from our model. (Correlaries show our bounds can be weakened to provide either L1 bounds or KL bounds which provide easier direct comparisons to previous work.) Our theorem uses conditions that are checkable from the data, instead of putting conditions on the unobservable Markov transition matrix.

preprint2011arXiv

Stochastic convex optimization with bandit feedback

This paper addresses the problem of minimizing a convex, Lipschitz function $f$ over a convex, compact set $\xset$ under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value $f(x)$ at any query point $x \in \xset$. The quantity of interest is the regret of the algorithm, which is the sum of the function values at algorithm's query points minus the optimal function value. We demonstrate a generalization of the ellipsoid algorithm that incurs $\otil(\poly(d)\sqrt{T})$ regret. Since any algorithm has regret at least $Ω(\sqrt{T})$ on this problem, our algorithm is optimal in terms of the scaling with $T$.

preprint2011arXiv

The effect of winning an Oscar Award on survival: Correcting for healthy performer survivor bias with a rank preserving structural accelerated failure time model

We study the causal effect of winning an Oscar Award on an actor or actress's survival. Does the increase in social rank from a performer winning an Oscar increase the performer's life expectancy? Previous studies of this issue have suffered from healthy performer survivor bias, that is, candidates who are healthier will be able to act in more films and have more chance to win Oscar Awards. To correct this bias, we adapt Robins' rank preserving structural accelerated failure time model and $g$-estimation method. We show in simulation studies that this approach corrects the bias contained in previous studies. We estimate that the effect of winning an Oscar Award on survival is 4.2 years, with a 95% confidence interval of $[-0.4,8.4]$ years. There is not strong evidence that winning an Oscar increases life expectancy.

Dean P. Foster

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

A Few Expert Queries Suffices for Sample-Efficient RL with Resets and Linear Value Approximation

Impartial Predictive Modeling and the Use of Proxy Variables

The Benefits of Implicit Regularization from SGD in Least Squares Problems

Fitting High-Dimensional Interaction Models with Error Control

Kernel ridge vs. principal component regression: minimax bounds and adaptability of regularization operators

Orbiting Radiation Stars

Submodularity in Statistics: Comparing the Success of Model Selection Methods

A Risk Ratio Comparison of $l_0$ and $l_1$ Penalized Regression

Fast Ridge Regression with Randomized Principal Component Analysis and Gradient Descent

Large scale canonical correlation analysis with iterative least squares

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

A Spectral Algorithm for Latent Dirichlet Allocation

Optimal Weighting of Multi-View Data with Low Dimensional Hidden States

Spectral dimensionality reduction for HMMs

Stochastic convex optimization with bandit feedback

The effect of winning an Oscar Award on survival: Correcting for healthy performer survivor bias with a rank preserving structural accelerated failure time model