Source author record

Giles Hooker

Giles Hooker appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Machine Learning Applications Computation math.ST Statistics Theory Artificial Intelligence stat.OT

Catalog footprint

What is connected

27works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Estimating Implicit Regularization in Deep Learning

Deep learning systems are known to exhibit implicit regularization (alt. implicit bias), favoring simple solutions instead of merely minimizing the loss function. In some cases, we can analytically derive the implicit regularization -- connecting it to an equivalent penalty that augments the learning objective. However, modern deep learning systems are complex, carrying modifications to the training procedure and architecture (e.g. early stopping, minibatching, dropout) whose effects are not always directly interpretable. Although estimating the resulting implicit regularization could aid theorists in algorithm design and practitioners in interpreting their hyperparameter choices, this problem has received little direct attention. It is also tractable: regularization makes weight updates deviate from loss gradients, promising a signal for identifying implicit bias. Here we provide gradient matching methods that can be used to empirically estimate the implicit regularization. Our method works on networks with known regularization, recovering popular explicit penalties like $\ell_1$ and $\ell_2$. It also replicates known implicit effects, like the quadratic weight penalty induced by early stopping in gradient descent, demonstrating that it can be used to test theories of implicit regularization. Crucially, because our method is empirical, it can handle implicit regularization in arbitrary networks. We demonstrate this use by characterizing the effects of dropout in deep networks, showing implicit $\ell_2$ effects in this popular method. Our work shows that practitioners can use gradient matching to understand regularization in networks with implicit biases that are too complicated to derive analytically.

preprint2022arXiv

The Infinitesimal Jackknife and Combinations of Models

The Infinitesimal Jackknife is a general method for estimating variances of parametric models, and more recently also for some ensemble methods. In this paper we extend the Infinitesimal Jackknife to estimate the covariance between any two models. This can be used to quantify uncertainty for combinations of models, or to construct test statistics for comparing different models or ensembles of models fitted using the same training dataset. Specific examples in this paper use boosted combinations of models like random forests and M-estimators. We also investigate its application on neural networks and ensembles of XGBoost models. We illustrate the efficacy of variance estimates through extensive simulations and its application to the Beijing Housing data, and demonstrate the theoretical consistency of the Infinitesimal Jackknife covariance estimate.

preprint2021arXiv

Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning

In 2001, Leo Breiman wrote of a divide between "data modeling" and "algorithmic modeling" cultures. Twenty years later this division feels far more ephemeral, both in terms of assigning individuals to camps, and in terms of intellectual boundaries. We argue that this is largely due to the "data modelers" incorporating algorithmic methods into their toolbox, particularly driven by recent developments in the statistical understanding of Breiman's own Random Forest methods. While this can be simplistically described as "Breiman won", these same developments also expose the limitations of the prediction-first philosophy that he espoused, making careful statistical analysis all the more important. This paper outlines these exciting recent developments in the random forest literature which, in our view, occurred as a result of a necessary blending of the two ways of thinking Breiman originally described. We also ask what areas statistics and statisticians might currently overlook.

preprint2021arXiv

Generalised Boosted Forests

This paper extends recent work on boosting random forests to model non-Gaussian responses. Given an exponential family $\mathbb{E}[Y|X] = g^{-1}(f(X))$ our goal is to obtain an estimate for $f$. We start with an MLE-type estimate in the link space and then define generalised residuals from it. We use these residuals and some corresponding weights to fit a base random forest and then repeat the same to obtain a boost random forest. We call the sum of these three estimators a \textit{generalised boosted forest}. We show with simulated and real data that both the random forest steps reduces test-set log-likelihood, which we treat as our primary metric. We also provide a variance estimator, which we can obtain with the same computational cost as the original estimate itself. Empirical experiments on real-world data and simulations demonstrate that the methods can effectively reduce bias, and that confidence interval coverage is conservative in the bulk of the covariate distribution.

preprint2020arXiv

$V$-statistics and Variance Estimation

This paper develops a general framework for analyzing asymptotics of $V$-statistics. Previous literature on limiting distribution mainly focuses on the cases when $n \to \infty$ with fixed kernel size $k$. Under some regularity conditions, we demonstrate asymptotic normality when $k$ grows with $n$ by utilizing existing results for $U$-statistics. The key in our approach lies in a mathematical reduction to $U$-statistics by designing an equivalent kernel for $V$-statistics. We also provide a unified treatment on variance estimation for both $U$- and $V$-statistics by observing connections to existing methods and proposing an empirically more accurate estimator. Ensemble methods such as random forests, where multiple base learners are trained and aggregated for prediction purposes, serve as a running example throughout the paper because they are a natural and flexible application of $V$-statistics.

preprint2020arXiv

Boosting Random Forests to Reduce Bias; One-Step Boosted Forest and its Variance Estimate

In this paper we propose using the principle of boosting to reduce the bias of a random forest prediction in the regression setting. From the original random forest fit we extract the residuals and then fit another random forest to these residuals. We call the sum of these two random forests a \textit{one-step boosted forest}. We show with simulated and real data that the one-step boosted forest has a reduced bias compared to the original random forest. The paper also provides a variance estimate of the one-step boosted forest by an extension of the infinitesimal Jackknife estimator. Using this variance estimate we can construct prediction intervals for the boosted forest and we show that they have good coverage probabilities. Combining the bias reduction and the variance estimate we show that the one-step boosted forest has a significant reduction in predictive mean squared error and thus an improvement in predictive performance. When applied on datasets from the UCI database, one-step boosted forest performs better than random forest and gradient boosting machine algorithms. Theoretically we can also extend such a boosting process to more than one step and the same principles outlined in this paper can be used to find variance estimates for such predictors. Such boosting will reduce bias even further but it risks over-fitting and also increases the computational burden.

preprint2020arXiv

Purifying Interaction Effects with the Functional ANOVA: An Efficient Algorithm for Recovering Identifiable Additive Models

Models which estimate main effects of individual variables alongside interaction effects have an identifiability challenge: effects can be freely moved between main effects and interaction effects without changing the model prediction. This is a critical problem for interpretability because it permits "contradictory" models to represent the same function. To solve this problem, we propose pure interaction effects: variance in the outcome which cannot be represented by any smaller subset of features. This definition has an equivalence with the Functional ANOVA decomposition. To compute this decomposition, we present a fast, exact algorithm that transforms any piecewise-constant function (such as a tree-based model) into a purified, canonical representation. We apply this algorithm to Generalized Additive Models with interactions trained on several datasets and show large disparity, including contradictions, between the effects before and after purification. These results underscore the need to specify data distributions and ensure identifiability before interpreting model parameters.

preprint2020arXiv

Selecting the Derivative of a Functional Covariate in Scalar-on-Function Regression

This paper presents tests to formally choose between regression models using different derivatives of a functional covariate in scalar-on-function regression. We demonstrate that for linear regression, models using different derivatives can be nested within a model that includes point-impact effects at the end-points of the observed functions. Contrasts can then be employed to test the specification of different derivatives. When nonlinear regression models are defined, we apply a $J$ test to determine the statistical significance of the nonlinear structure between a functional covariate and a scalar response. The finite-sample performance of these methods is verified in simulation, and their practical application is demonstrated using a chemometric data set.

preprint2020arXiv

Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable

Ensembles of decision trees perform well on many problems, but are not interpretable. In contrast to existing approaches in interpretability that focus on explaining relationships between features and predictions, we propose an alternative approach to interpret tree ensemble classifiers by surfacing representative points for each class -- prototypes. We introduce a new distance for Gradient Boosted Tree models, and propose new, adaptive prototype selection methods with theoretical guarantees, with the flexibility to choose a different number of prototypes in each class. We demonstrate our methods on random forests and gradient boosted trees, showing that the prototypes can perform as well as or even better than the original tree ensemble when used as a nearest-prototype classifier. In a user study, humans were better at predicting the output of a tree ensemble classifier when using prototypes than when using Shapley values, a popular feature attribution method. Hence, prototypes present a viable alternative to feature-based explanations for tree ensembles.

preprint2020arXiv

Unbiased Measurement of Feature Importance in Tree-Based Methods

We propose a modification that corrects for split-improvement variable importance measures in Random Forests and other tree-based methods. These methods have been shown to be biased towards increasing the importance of features with more potential splits. We show that by appropriately incorporating split-improvement as measured on out of sample data, this bias can be corrected yielding better summaries and screening tools.

preprint2016arXiv

Adapted Variational Bayes for Functional Data Registration, Smoothing, and Prediction

We propose a model for functional data registration that compares favorably to the best methods of functional data registration currently available. It also extends current inferential capabilities for unregistered data by providing a flexible probabilistic framework that 1) allows for functional prediction in the context of registration and 2) can be adapted to include smoothing and registration in one model. The proposed inferential framework is a Bayesian hierarchical model where the registered functions are modeled as Gaussian processes. To address the computational demands of inference in high-dimensional Bayesian models, we propose an adapted form of the variational Bayes algorithm for approximate inference that performs similarly to MCMC sampling methods for well-defined problems. The efficiency of the adapted variational Bayes (AVB) algorithm allows variability in a predicted registered, warping, and unregistered function to be depicted separately via bootstrapping. Temperature data related to the el-niño phenomenon is used to demonstrate the unique inferential capabilities for prediction provided by this model.

preprint2016arXiv

Consistency, efficiency and robustness of conditional disparity methods

This paper considers extensions of minimum-disparity estimators to the problem of estimating parameters in a regression model that is conditionally specified; that is where a parametric model describes the distribution of a response $y$ conditional on covariates $x$ but does not specify the distribution of $x$. We define these estimators by estimating a non-parametric conditional density estimates and minimizing a disparity between this estimate and the parametric model averaged over values of $x$. The consistency and asymptotic normality of such estimators is demonstrated for a broad class of models in which response and covariate vectors can take both discrete and continuous values and incorportates a wide set of choices for kernel-based conditional density estimation. It also establishes the robustness of these estimators for a broad class of disparities. As has been observed in Tamura and Boos (J. Amer. Statist. Assoc. 81 (1986) 223--229), minimum disparity estimators incorporating kernel density estimates of more than one dimension can result in an asymptotic bias that is larger that $n^{-1/2}$ and we characterize a similar bias in our results and show that in specialized cases it can be eliminated by appropriately centering the kernel density estimate. We also demonstrate empirically that bootstrap methods can be employed to reduce this bias and to provide robust confidence intervals. In order to demonstrate these results, we establish a set of $L_1$-consistency results for kernel-based estimates of centered conditional densities.

preprint2016arXiv

Formal Hypothesis Tests for Additive Structure in Random Forests

While statistical learning methods have proved powerful tools for predictive modeling, the black-box nature of the models they produce can severely limit their interpretability and the ability to conduct formal inference. However, the natural structure of ensemble learners like bagged trees and random forests has been shown to admit desirable asymptotic properties when base learners are built with proper subsamples. In this work, we demonstrate that by defining an appropriate grid structure on the covariate space, we may carry out formal hypothesis tests for both variable importance and underlying additive model structure. To our knowledge, these tests represent the first statistical tools for investigating the underlying regression structure in a context such as random forests. We develop notions of total and partial additivity and further demonstrate that testing can be carried out at no additional computational cost by estimating the variance within the process of constructing the ensemble. Furthermore, we propose a novel extension of these testing procedures utilizing random projections in order to allow for computationally efficient testing procedures that retain high power even when the grid size is much larger than that of the training set.

preprint2016arXiv

Interpreting Models via Single Tree Approximation

We propose a procedure to build a decision tree which approximates the performance of complex machine learning models. This single approximation tree can be used to interpret and simplify the predicting pattern of random forests (RFs) and other models. The use of a tree structure is particularly relevant in medical questionnaires where it enables an adaptive shortening of the questionnaire, reducing response burden. We study the asymptotic behavior of splits and introduce an improved splitting method designed to stabilize tree structure. Empirical studies on both simulation and real data sets illustrate that our method can simultaneously achieve high approximation power and stability.

preprint2015arXiv

Bootstrap Bias Corrections for Ensemble Methods

This paper examines the use of a residual bootstrap for bias correction in machine learning regression methods. Accounting for bias is an important obstacle in recent efforts to develop statistical inference for machine learning methods. We demonstrate empirically that the proposed bootstrap bias correction can lead to substantial improvements in both bias and predictive accuracy. In the context of ensembles of trees, we show that this correction can be approximated at only double the cost of training the original ensemble without introducing additional variance. Our method is shown to improve test-set accuracy over random forests by up to 70\% on example problems from the UCI repository.

preprint2015arXiv

Combining Functional Data Registration and Factor Analysis

We extend the definition of functional data registration to encompass a larger class of registered functions. In contrast to traditional registration models, we allow for registered functions that have more than one primary direction of variation. The proposed Bayesian hierarchical model simultaneously registers the observed functions and estimates the two primary factors that characterize variation in the registered functions. Each registered function is assumed to be predominantly composed of a linear combination of these two primary factors, and the function-specific weights for each observation are estimated within the registration model. We show how these estimated weights can easily be used to classify functions after registration using both simulated data and a juggling data set.

preprint2015arXiv

Goodness of fit in nonlinear dynamics: Misspecified rates or misspecified states?

This paper introduces diagnostic tests for the nature of lack of fit in ordinary differential equation models (ODEs) proposed for data. We present a hierarchy of three possible sources of lack of fit: unaccounted-for stochastic variation, misspecification of functional forms in rate equations, and omission of dynamic variables in the description of the system. We represent lack of fit by allowing a parameter vector to vary over time, and propose generic testing procedures that do not rely on specific alternative models. Instead, different sources for lack of fit are characterized in terms of nonparametric relationships among latent variables. The tests are carried out through a combination of residual bootstrap and permutation methods. We demonstrate the effectiveness of these tests on simulated data and on real data from laboratory ecological experiments and electro-cardiogram data.

preprint2015arXiv

Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests

This work develops formal statistical inference procedures for machine learning ensemble methods. Ensemble methods based on bootstrapping, such as bagging and random forests, have improved the predictive accuracy of individual trees, but fail to provide a framework in which distributional results can be easily determined. Instead of aggregating full bootstrap samples, we consider predicting by averaging over trees built on subsamples of the training set and demonstrate that the resulting estimator takes the form of a U-statistic. As such, predictions for individual feature vectors are asymptotically normal, allowing for confidence intervals to accompany predictions. In practice, a subset of subsamples is used for computational speed; here our estimators take the form of incomplete U-statistics and equivalent results are derived. We further demonstrate that this setup provides a framework for testing the significance of features. Moreover, the internal estimation method we develop allows us to estimate the variance parameters and perform these inference procedures at no additional computational cost. Simulations and illustrations on a real dataset are provided.

preprint2014arXiv

Control Theory and Experimental Design in Diffusion Processes

This paper considers the problem of designing time-dependent, real-time control policies for controllable nonlinear diffusion processes, with the goal of obtaining maximally-informative observations about parameters of interest. More precisely, we maximize the expected Fisher information for the parameter obtained over the duration of the experiment, conditional on observations made up to that time. We propose to accomplish this with a two-step strategy: when the full state vector of the diffusion process is observable continuously, we formulate this as an optimal control problem and apply numerical techniques from stochastic optimal control to solve it. When observations are incomplete, infrequent, or noisy, we propose using standard filtering techniques to first estimate the state of the system, then apply the optimal control policy using the posterior expectation of the state. We assess the effectiveness of these methods in 3 situations: a paradigmatic bistable model from statistical physics, a model of action potential generation in neurons, and a model of a simple ecological system.

preprint2014arXiv

Functional Principal Components Analysis of Spatially Correlated Data

This paper focuses on the analysis of spatially correlated functional data. The between-curve correlation is modeled by correlating functional principal component scores of the functional data. We propose a Spatial Principal Analysis by Conditional Expectation framework to explicitly estimate spatial correlations and reconstruct individual curves. This approach works even when the observed data per curve are sparse. Assuming spatial stationarity, empirical spatial correlations are calculated as the ratio of eigenvalues of the smoothed covariance surface $Cov(X_i(s),X_i(t))$ and cross-covariance surface $Cov(X_i(s), X_j(t))$ at locations indexed by $i$ and $j$. Then a anisotropy Matérn spatial correlation model is fit to empirical correlations. Finally, principal component scores are estimated to reconstruct the sparsely observed curves. This framework can naturally accommodate arbitrary covariance structures, but there is an enormous reduction in computation if one can assume the separability of temporal and spatial components. We propose hypothesis tests to examine the separability as well as the isotropy effect of spatial correlation. Simulation studies and applications of empirical data show improvements in the curve reconstruction using our framework over the method where curves are assumed to be independent. In addition, we show that the asymptotic properties of estimates in uncorrelated case still hold in our case if 'mild' spatial correlation is assumed.

preprint2014arXiv

Maximal Autocorrelation Functions in Functional Data Analysis

This paper proposes a new factor rotation for the context of functional principal components analysis. This rotation seeks to re-represent a functional subspace in terms of directions of decreasing smoothness as represented by a generalized smoothing metric. The rotation can be implemented simply and we show on two examples that this rotation can improve the interpretability of the leading components.

preprint2014arXiv

Truncated Linear Models for Functional Data

A conventional linear model for functional data involves expressing a response variable $Y$ in terms of the explanatory function $X(t)$, via the model: $Y=a+\int_I b(t)X(t)dt+\hbox{error}$, where $a$ is a scalar, $b$ is an unknown function and $I=[0, α]$ is a compact interval. However, in some problems the support of $b$ or $X$, $I_1$ say, is a proper and unknown subset of $I$, and is a quantity of particular practical interest. In this paper, motivated by a real-data example involving particulate emissions, we develop methods for estimating $I_1$. We give particular emphasis to the case $I_1=[0,θ]$, where $θ\in(0,α]$, and suggest two methods for estimating $a$, $b$ and $θ$ jointly; we introduce techniques for selecting tuning parameters; and we explore properties of our methodology using both simulation and the real-data example mentioned above. Additionally, we derive theoretical properties of the methodology, and discuss implications of the theory. Our theoretical arguments give particular emphasis to the problem of identifiability.

preprint2013arXiv

Hellinger Distance and Bayesian Non-Parametrics: Hierarchical Models for Robust and Efficient Bayesian Inference

This paper introduces a hierarchical framework to incorporate Hellinger distance methods into Bayesian analysis. We propose to modify a prior over non-parametric densities with the exponential of twice the Hellinger distance between a candidate and a parametric density. By incorporating a prior over the parameters of the second density, we arrive at a hierarchical model in which a non-parametric model is placed between parameters and the data. The parameters of the family can then be estimated as hyperparameters in the model. In frequentist estimation, minimizing the Hellinger distance between a kernel density estimate and a parametric family has been shown to produce estimators that are both robust to outliers and statistically efficient when the parametric model is correct. In this paper, we demonstrate that the same results are applicable when a non-parametric Bayes density estimate replaces the kernel density estimate. We then demonstrate that robustness and efficiency also hold for the proposed hierarchical model. The finite-sample behavior of the resulting estimates is investigated by simulation and on real world data.

preprint2013arXiv

On the Identifiability of the Functional Convolution Model

This report details conditions under which the Functional Convolution Model described in \citet{AHG13} can be identified from Ordinary Least Squares estimates without either dimension reduction or smoothing penalties. We demonstrate that if the covariate functions are not spanned by the space of solutions to linear differential equations, the functional coefficients in the model are uniquely determined in the Sobolev space of functions with absolutely continuous second derivatives.

preprint2013arXiv

Restricted Likelihood Ratio Tests for Linearity in Scalar-on-Function Regression

We propose a procedure for testing the linearity of a scalar-on-function regression relationship. To do so, we use the functional generalized additive model (FGAM), a recently developed extension of the functional linear model. For a functional covariate X(t), the FGAM models the mean response as the integral with respect to t of F{X(t),t} where F is an unknown bivariate function. The FGAM can be viewed as the natural functional extension of generalized additive models. We show how the functional linear model can be represented as a simple mixed model nested within the FGAM. Using this representation, we then consider restricted likelihood ratio tests for zero variance components in mixed models to test the null hypothesis that the functional linear model holds. The methods are general and can also be applied to testing for interactions in a multivariate additive model or for testing for no effect in the functional linear model. The performance of the proposed tests is assessed on simulated data and in an application to measuring diesel truck emissions, where strong evidence of nonlinearities in the relationship between the functional predictor and the response are found.

preprint2012arXiv

Bayesian Model Robustness via Disparities

This paper develops a methodology for robust Bayesian inference through the use of disparities. Metrics such as Hellinger distance and negative exponential disparity have a long history in robust estimation in frequentist inference. We demonstrate that an equivalent robustification may be made in Bayesian inference by substituting an appropriately scaled disparity for the log likelihood to which standard Monte Carlo Markov Chain methods may be applied. A particularly appealing property of minimum-disparity methods is that while they yield robustness with a breakdown point of 1/2, the resulting parameter estimates are also efficient when the posited probabilistic model is correct. We demonstrate that a similar property holds for disparity-based Bayesian inference. We further show that in the Bayesian setting, it is also possible to extend these methods to robustify regression models, random effects distributions and other hierarchical models. The methods are demonstrated on real world data.

preprint2012arXiv

Functional factor analysis for periodic remote sensing data

We present a new approach to factor rotation for functional data. This is achieved by rotating the functional principal components toward a predefined space of periodic functions designed to decompose the total variation into components that are nearly-periodic and nearly-aperiodic with a predefined period. We show that the factor rotation can be obtained by calculation of canonical correlations between appropriate spaces which make the methodology computationally efficient. Moreover, we demonstrate that our proposed rotations provide stable and interpretable results in the presence of highly complex covariance. This work is motivated by the goal of finding interpretable sources of variability in gridded time series of vegetation index measurements obtained from remote sensing, and we demonstrate our methodology through an application of factor rotation of this data.

Giles Hooker

What is connected

Connect this record

See the researcher in context

Building this map preview

27 published item(s)

Estimating Implicit Regularization in Deep Learning

The Infinitesimal Jackknife and Combinations of Models

Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning

Generalised Boosted Forests

$V$-statistics and Variance Estimation

Boosting Random Forests to Reduce Bias; One-Step Boosted Forest and its Variance Estimate

Purifying Interaction Effects with the Functional ANOVA: An Efficient Algorithm for Recovering Identifiable Additive Models

Selecting the Derivative of a Functional Covariate in Scalar-on-Function Regression

Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable

Unbiased Measurement of Feature Importance in Tree-Based Methods

Adapted Variational Bayes for Functional Data Registration, Smoothing, and Prediction

Consistency, efficiency and robustness of conditional disparity methods

Formal Hypothesis Tests for Additive Structure in Random Forests

Interpreting Models via Single Tree Approximation

Bootstrap Bias Corrections for Ensemble Methods

Combining Functional Data Registration and Factor Analysis

Goodness of fit in nonlinear dynamics: Misspecified rates or misspecified states?

Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests

Control Theory and Experimental Design in Diffusion Processes

Functional Principal Components Analysis of Spatially Correlated Data

Maximal Autocorrelation Functions in Functional Data Analysis

Truncated Linear Models for Functional Data

Hellinger Distance and Bayesian Non-Parametrics: Hierarchical Models for Robust and Efficient Bayesian Inference

On the Identifiability of the Functional Convolution Model

Restricted Likelihood Ratio Tests for Linearity in Scalar-on-Function Regression

Bayesian Model Robustness via Disparities

Functional factor analysis for periodic remote sensing data