Source author record

Jeffrey S. Simonoff

Jeffrey S. Simonoff appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Methodology Machine Learning

Catalog footprint

What is connected

5works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Ensemble methods for survival function estimation with time-varying covariates

Survival data with time-varying covariates are common in practice. If relevant, they can improve on the estimation of survival function. However, the traditional survival forests - conditional inference forest, relative risk forest and random survival forest - have accommodated only time-invariant covariates. We generalize the conditional inference and relative risk forests to allow time-varying covariates. We also propose a general framework for estimation of a survival function in the presence of time-varying covariates. We compare their performance with that of the Cox model and transformation forest, adapted here to accommodate time-varying covariates, through a comprehensive simulation study in which the Kaplan-Meier estimate serves as a benchmark, and performance is compared using the integrated L2 difference between the true and estimated survival functions. In general, the performance of the two proposed forests substantially improves over the Kaplan-Meier estimate. Taking into account all other factors, under the proportional hazard (PH) setting, the best method is always one of the two proposed forests, while under the non-PH setting, it is the adapted transformation forest. K-fold cross-validation is used as an effective tool to choose between the methods in practice.

preprint2021arXiv

Dynamic estimation with random forests for discrete-time survival data

Time-varying covariates are often available in survival studies and estimation of the hazard function needs to be updated as new information becomes available. In this paper, we investigate several different easy-to-implement ways that random forests can be used for dynamic estimation of the survival or hazard function from discrete-time survival data. The results from a simulation study indicate that all methods can perform well, and that none dominates the others. In general, situations that are more difficult from an estimation point of view (such as weaker signals and less data) favour a global fit, pooling over all time points, while situations that are easier from an estimation point of view (such as stronger signals and more data) favor local fits.

preprint2021arXiv

On the Use of Information Criteria for Subset Selection in Least Squares Regression

Least squares (LS)-based subset selection methods are popular in linear regression modeling. Best subset selection (BS) is known to be NP-hard and has a computational cost that grows exponentially with the number of predictors. Recently, Bertsimas (2016) formulated BS as a mixed integer optimization (MIO) problem and largely reduced the computation overhead by using a well-developed optimization solver, but the current methodology is not scalable to very large datasets. In this paper, we propose a novel LS-based method, the best orthogonalized subset selection (BOSS) method, which performs BS upon an orthogonalized basis of ordered predictors and scales easily to large problem sizes. Another challenge in applying LS-based methods in practice is the selection rule to choose the optimal subset size k. Cross-validation (CV) requires fitting a procedure multiple times, and results in a selected k that is random across repeated application to the same dataset. Compared to CV, information criteria only require fitting a procedure once, but they require knowledge of the effective degrees of freedom for the fitting procedure, which is generally not available analytically for complex methods. Since BOSS uses orthogonalized predictors, we first explore a connection for orthogonal non-random predictors between BS and its Lagrangian formulation (i.e., minimization of the residual sum of squares plus the product of a regularization parameter and k), and based on this connection propose a heuristic degrees of freedom (hdf) for BOSS that can be estimated via an analytically-based expression. We show in both simulations and real data analysis that BOSS using a proposed Kullback-Leibler based information criterion AICc-hdf has the strongest performance of all of the LS-based methods considered and is competitive with regularization methods, with the computational effort of a single ordinary LS fit.

preprint2020arXiv

Hot Racquet or Not? An Exploration of Momentum in Grand Slam Tennis Matches

The presence of momentum in sports, where the outcome of a previous event affects a following event, is a belief often held by fans, and has been the subject of statistical study for many sports. This paper investigates the presence and manifestation of momentum in Grand Slam tennis matches from 2014 through 2019 for men and women, to see if there is evidence of any carryover effect from the outcome of previous point(s)/game(s)/set(s) to the current one that cannot be accounted for by player quality, fitness or fatigue. Generalized linear mixed effect models (GLMMs) are used to explore the effects of the outcomes of previous sets, games, or points on the odds of winning a set, game, or point, while incorporating control variables that account for differences in player quality and the current status of the match. We find strong evidence of carryover effects at the set, game, and point level. Holding one's serve in prior service games is strongly related to winning a current game, but losing a past game is associated with higher estimated odds of winning the next game. Winning the previous two or three points in a row is associated with highest estimated odds of winning the next point.

preprint2020arXiv

Selection of Regression Models under Linear Restrictions for Fixed and Random Designs

Many important modeling tasks in linear regression, including variable selection (in which slopes of some predictors are set equal to zero) and simplified models based on sums or differences of predictors (in which slopes of those predictors are set equal to each other, or the negative of each other, respectively), can be viewed as being based on imposing linear restrictions on regression parameters. In this paper, we discuss how such models can be compared using information criteria designed to estimate predictive measures like squared error and Kullback-Leibler (KL) discrepancy, in the presence of either deterministic predictors (fixed-X) or random predictors (random-X). We extend the justifications for existing fixed-X criteria Cp, FPE and AICc, and random-X criteria Sp and RCp, to general linear restrictions. We further propose and justify a KL-based criterion, RAICc, under random-X for variable selection and general linear restrictions. We show in simulations that the use of the KL-based criteria AICc and RAICc results in better predictive performance and sparser solutions than the use of squared error-based criteria, including cross-validation.

Jeffrey S. Simonoff

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

Ensemble methods for survival function estimation with time-varying covariates

Dynamic estimation with random forests for discrete-time survival data

On the Use of Information Criteria for Subset Selection in Least Squares Regression

Hot Racquet or Not? An Exploration of Momentum in Grand Slam Tennis Matches

Selection of Regression Models under Linear Restrictions for Fixed and Random Designs