Source author record

Benjamin S. Baumer

Benjamin S. Baumer appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation stat.OT Applications cs.CY Computational Engineering, Finance, and Science Machine Learning Methodology Other Computer Science

Catalog footprint

What is connected

8works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

tidychangepoint: A Unified Framework for Analyzing Changepoint Detection in Univariate Time Series

We present tidychangepoint, a new R package for changepoint detection analysis. Most R packages for segmenting univariate time series focus on providing one or two algorithms for changepoint detection that work with a small set of models and penalized objective functions, and all of them return a custom, nonstandard object type. This makes comparing results across various algorithms, models, and penalized objective functions unnecessarily difficult. tidychangepoint solves this problem by wrapping functions from a variety of existing packages and storing the results in a common S3 class called tidycpt. The package then provides functionality for easily extracting comparable numeric or graphical information from a tidycpt object, all in a tidyverse-compliant framework. tidychangepoint is versatile: it supports both deterministic algorithms like PELT (from changepoint), and also flexible, randomized, genetic algorithms (via GA) that -- via new functionality built into tidychangepoint -- can be used with any compliant model-fitting function and any penalized objective function. By bringing all of these disparate tools together in a cohesive fashion, tidychangepoint facilitates comparative analysis of changepoint detection algorithms and models.

preprint2023arXiv

Big Ideas in Sports Analytics and Statistical Tools for their Investigation

Sports analytics -- broadly defined as the pursuit of improvement in athletic performance through the analysis of data -- has expanded its footprint both in the professional sports industry and in academia over the past 30 years. In this paper, we connect four big ideas that are common across multiple sports: the expected value of a game state, win probability, measures of team strength, and the use of sports betting market data. For each, we explore both the shared similarities and individual idiosyncrasies of analytical approaches in each sport. While our focus is on the concepts underlying each type of analysis, any implementation necessarily involves statistical methodologies, computational tools, and data sources. Where appropriate, we outline how data, models, tools, and knowledge of the sport combine to generate actionable insights. We also describe opportunities to share analytical work, but omit an in-depth discussion of individual player evaluation as beyond our scope. This paper should serve as a useful overview for anyone becoming interested in the study of sports analytics.

preprint2022arXiv

Integrating data science ethics into an undergraduate major: A case study

We present a programmatic approach to incorporating ethics into an undergraduate major in statistical and data sciences. We discuss departmental-level initiatives designed to meet the National Academy of Sciences recommendation for integrating ethics into the curriculum from top-to-bottom as our majors progress from our introductory courses to our senior capstone course, as well as from side-to-side through co-curricular programming. We also provide six examples of data science ethics modules used in five different courses at our liberal arts college, each focusing on a different ethical consideration. The modules are designed to be portable such that they can be flexibly incorporated into existing courses at different levels of instruction with minimal disruption to syllabi. We connect our efforts to a growing body of literature on the teaching of data science ethics, present assessments of our effectiveness, and conclude with next steps and final thoughts.

preprint2021arXiv

Facilitating team-based data science: lessons learned from the DSC-WAV project

While coursework provides undergraduate data science students with some relevant analytic skills, many are not given the rich experiences with data and computing they need to be successful in the workplace. Additionally, students often have limited exposure to team-based data science and the principles and tools of collaboration that are encountered outside of school. In this paper, we describe the DSC-WAV program, an NSF-funded data science workforce development project in which teams of undergraduate sophomores and juniors work with a local non-profit organization on a data-focused problem. To help students develop a sense of agency and improve confidence in their technical and non-technical data science skills, the project promoted a team-based approach to data science, adopting several processes and tools intended to facilitate this collaboration. Evidence from the project evaluation, including participant survey and interview data, is presented to document the degree to which the project was successful in engaging students in team-based data science, and how the project changed the students' perceptions of their technical and non-technical skills. We also examine opportunities for improvement and offer insight to other data science educators who may want to implement a similar team-based approach to data science projects at their own institutions.

preprint2020arXiv

Creating optimal conditions for reproducible data analysis in R with 'fertile'

The advancement of scientific knowledge increasingly depends on ensuring that data-driven research is reproducible: that two people with the same data obtain the same results. However, while the necessity of reproducibility is clear, there are significant behavioral and technical challenges that impede its widespread implementation, and no clear consensus on standards of what constitutes reproducibility in published research. We present fertile, an R package that focuses on a series of common mistakes programmers make while conducting data science projects in R, primarily through the RStudio integrated development environment. fertile operates in two modes: proactively (to prevent reproducibility mistakes from happening in the first place), and retroactively (analyzing code that is already written for potential problems). Furthermore, fertile is designed to educate users on why their mistakes are problematic and how to fix them.

preprint2017arXiv

Greater data science at baccalaureate institutions

Donoho's JCGS (in press) paper is a spirited call to action for statisticians, who he points out are losing ground in the field of data science by refusing to accept that data science is its own domain. (Or, at least, a domain that is becoming distinctly defined.) He calls on writings by John Tukey, Bill Cleveland, and Leo Breiman, among others, to remind us that statisticians have been dealing with data science for years, and encourages acceptance of the direction of the field while also ensuring that statistics is tightly integrated. As faculty at baccalaureate institutions (where the growth of undergraduate statistics programs has been dramatic), we are keen to ensure statistics has a place in data science and data science education. In his paper, Donoho is primarily focused on graduate education. At our undergraduate institutions, we are considering many of the same questions.

preprint2015arXiv

openWAR: An Open Source System for Evaluating Overall Player Performance in Major League Baseball

Within baseball analytics, there is substantial interest in comprehensive statistics intended to capture overall player performance. One such measure is Wins Above Replacement (WAR), which aggregates the contributions of a player in each facet of the game: hitting, pitching, baserunning, and fielding. However, current versions of WAR depend upon proprietary data, ad hoc methodology, and opaque calculations. We propose a competitive aggregate measure, openWAR, that is based upon public data and methodology with greater rigor and transparency. We discuss a principled standard for the nebulous concept of a "replacement" player. Finally, we use simulation-based techniques to provide interval estimates for our openWAR measure.

preprint2015arXiv

Setting the stage for data science: integration of data management skills in introductory and second courses in statistics

Many have argued that statistics students need additional facility to express statistical computations. By introducing students to commonplace tools for data management, visualization, and reproducible analysis in data science and applying these to real-world scenarios, we prepare them to think statistically. In an era of increasingly big data, it is imperative that students develop data-related capacities, beginning with the introductory course. We believe that the integration of these precursors to data science into our curricula-early and often-will help statisticians be part of the dialogue regarding "Big Data" and "Big Questions".

Benjamin S. Baumer

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

tidychangepoint: A Unified Framework for Analyzing Changepoint Detection in Univariate Time Series

Big Ideas in Sports Analytics and Statistical Tools for their Investigation

Integrating data science ethics into an undergraduate major: A case study

Facilitating team-based data science: lessons learned from the DSC-WAV project

Creating optimal conditions for reproducible data analysis in R with 'fertile'

Greater data science at baccalaureate institutions

openWAR: An Open Source System for Evaluating Overall Player Performance in Major League Baseball

Setting the stage for data science: integration of data management skills in introductory and second courses in statistics