Source author record

Dianne Cook

Dianne Cook appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Computation hep-ex Methodology astro-ph.IM hep-ph Neural and Evolutionary Computing physics.data-an stat.OT

Catalog footprint

What is connected

10works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Squintability and Other Metrics for Assessing Projection Pursuit Indexes, and Guiding Optimization Choices

The projection pursuit (PP) guided tour optimizes a criterion function, known as the PP index, to gradually reveal projections of interest from high-dimensional data through animation. Optimization of some PP indexes can be non-trivial, if they are non-smooth functions, or when the optimum has a small "squint angle", detectable only from close proximity. Here, measures for calculating the smoothness and squintability properties of the PP index are defined. These are used to investigate the performance of a recently introduced swarm-based algorithm, Jellyfish Search Optimizer (JSO), for optimizing PP indexes. The performance of JSO in detecting the target pattern (pipe shape) is compared with existing optimizers in PP. Additionally, JSO's performance on detecting the sine-wave shape is evaluated using different PP indexes (hence different smoothness and squintability) across various data dimensions (d = 4, 6, 8, 10, 12) and JSO hyper-parameters. We observe empirically that higher squintability improves the success rate of the PP index optimization, while smoothness has no significant effect. The JSO algorithm has been implemented in the R package, `tourr`, and functions to calculate smoothness and squintability measures are implemented in the `ferrn` package.

preprint2022arXiv

A Journey from Wild to Textbook Data to Reproducibly Refresh the Wages Data from the National Longitudinal Survey of Youth Database

Textbook data is essential for teaching statistics and data science methods because they are clean, allowing the instructor to focus on methodology. Ideally textbook data sets are refreshed regularly, especially when they are subsets taken from an on-going data collection. It is also important to use contemporary data for teaching, to imbue the sense that the methodology is relevant today. This paper describes the trials and tribulations of refreshing a textbook data set on wages, extracted from the National Longitudinal Survey of Youth (NLSY79) in the early 1990s. The data is useful for teaching modeling and exploratory analysis of longitudinal data. Subsets of NLSY79, including the wages data, can be found in supplementary files from numerous textbooks and research articles. The NLSY79 database has been continuously updated through to 2018, so new records are available. Here we describe our journey to refresh the wages data, and document the process so that the data can be regularly updated into the future. Our journey was difficult because the steps and decisions taken to get from the raw data to the wages textbook subset have not been clearly articulated. We have been diligent to provide a reproducible workflow for others to follow, which also hopefully inspires more attempts at refreshing data for teaching. Three new data sets and the code to produce them are provided in the open source R package called `yowie`.

preprint2022arXiv

A Study on a User-Controlled Radial Tour for Variable Importance in High-Dimensional Data

Principal component analysis is a long-standing go-to method for exploring multivariate data. The principal components are linear combinations of the original variables, ordered by descending variance. The first few components typically provide a good visual summary of the data. Tours also make linear projections of the original variables but offer many different views, like examining the data from different directions. The grand tour shows a smooth sequence of projections as an animation following interpolations between random target bases. The manual radial tour rotates the selected variable's contribution into and out of a projection. This allows the importance of the variable to structure in the projection to be assessed. This work describes a mixed-design user study evaluating the radial tour's efficacy compared with principal component analysis and the grand tour. A supervised classification task is assigned to participants who evaluate variable attribution of the separation between two classes. Their accuracy in assigning the variable importance is measured across various factors. Data were collected from 108 crowdsourced participants, who performed two trials with each visual for 648 trials in total. Mixed model regression finds strong evidence that the radial tour results in a large increase in accuracy over the alternatives. Participants also reported a preference for the radial tour in comparison to the other two methods.

preprint2022arXiv

Hole or grain? A Section Pursuit Index for Finding Hidden Structure in Multiple Dimensions

Multivariate data is often visualized using linear projections, produced by techniques such as principal component analysis, linear discriminant analysis, and projection pursuit. A problem with projections is that they obscure low and high density regions near the center of the distribution. Sections, or slices, can help to reveal them. This paper develops a section pursuit method, building on the extensive work in projection pursuit, to search for interesting slices of the data. Linear projections are used to define sections of the parameter space, and to calculate interestingness by comparing the distribution of observations, inside and outside a section. By optimizing this index, it is possible to reveal features such as holes (low density) or grains (high density). The optimization is incorporated into a guided tour so that the search for structure can be dynamic. The approach can be useful for problems when data distributions depart from uniform or normal, as in visually exploring nonlinear manifolds, and functions in multivariate space. Two applications of section pursuit are shown: exploring decision boundaries from classification models, and exploring subspaces induced by complex inequality conditions from multiple parameter model. The new methods are available in R, in the tourr package.

preprint2020arXiv

Using tours to visually investigate properties of new projection pursuit indexes with application to problems in physics

Projection pursuit is used to find interesting low-dimensional projections of high-dimensional data by optimizing an index over all possible projections. Most indexes have been developed to detect departure from known distributions, such as normality, or to find separations between known groups. Here, we are interested in finding projections revealing potentially complex bivariate patterns, using new indexes constructed from scagnostics and a maximum information coefficient, with a purpose to detect unusual relationships between model parameters describing physics phenomena. The performance of these indexes is examined with respect to ideal behaviour, using simulated data, and then applied to problems from gravitational wave astronomy. The implementation builds upon the projection pursuit tools available in the R package, tourr, with indexes constructed from code in the R packages, scagnostics, minerva and mbgraphic.

preprint2014arXiv

An algorithm for deciding the number of clusters and validation using simulated data with application to exploring crop population structure

A first step in exploring population structure in crop plants and other organisms is to define the number of subpopulations that exist for a given data set. The genetic marker data sets being generated have become increasingly large over time and commonly are of the high-dimension, low sample size (HDLSS) situation. An algorithm for deciding the number of clusters is proposed, and is validated on simulated data sets varying in both the level of structure and the number of clusters covering the range of variation observed empirically. The algorithm was then tested on six empirical data sets across three small grain species. The algorithm uses bootstrapping, three methods of clustering, and defines the optimum number of clusters based on a common criterion, the Hubert's gamma statistic. Validation on simulated sets coupled with testing on empirical sets suggests that the algorithm can be used for a wide variety of genetic data sets.

preprint2014arXiv

Enabling Interactivity on Displays of Multivariate Time Series and Longitudinal Data

Temporal data is information measured in the context of time. This contextual structure provides components that need to be explored to understand the data and that can form the basis of interactions applied to the plots. In multivariate time series we expect to see temporal dependence, long term and seasonal trends and cross-correlations. In longitudinal data we also expect within and between subject dependence. Time series and longitudinal data, although analyzed differently, are often plotted using similar displays. We provide a taxonomy of interactions on plots that can enable exploring temporal components of these data types, and describe how to build these interactions using data transformations. Because temporal data is often accompanied other types of data we also describe how to link the temporal plots with other displays of data. The ideas are conceptualized into a data pipeline for temporal data, and implemented into the R package cranvas. This package provides many different types of interactive graphics that can be used together to explore data or diagnose a model fit.

preprint2014arXiv

Four Papers on Contemporary Software Design Strategies for Statistical Methodologists

Software design impacts much of statistical analysis and, as technology changes, dramatically so in recent years, it is exciting to learn how statistical software is adapting and changing. This leads to the collection of papers published here, written by John Chambers, Duncan Temple Lang, Michael Lawrence, Martin Morgan, Yihui Xie, Heike Hofmann and Xiaoyue Cheng.

preprint2014arXiv

Human Factors Influencing Visual Statistical Inference

Visual statistical inference is a way to determine significance of patterns found while exploring data. It is dependent on the evaluation of a lineup, of a data plot among a sample of null plots, by human observers. Each individual is different in their cognitive psychology and judiciousness, which can affect the visual inference. The usual way to estimate the effectiveness of a statistical test is its power. The estimate of power of a lineup can be controlled by combining evaluations from multiple observers. Factors that may also affect the power of visual inference are the observers' demographics, visual skills, and experience, the sample of null plots taken from the null distribution, the position of the data plot in the lineup, and the signal strength in the data. This paper examines these factors. Results from multiple visual inference studies using Amazon's Mechanical Turk are examined to provide an assessment of these. The experiments suggest that individual skills vary substantially, but demographics do not have a huge effect on performance. There is evidence that a learning effect exists but only in that observers get faster with repeated evaluations, but not more often correct. The placement of data plot in the lineup does not affect the inference.

preprint2014arXiv

Utilizing Distance Metrics on Lineups to Examine What People Read From Data Plots

Graphics play a crucial role in statistical analysis and data mining. This paper describes metrics developed to assist the use of lineups for making inferential statements. Lineups embed the plot of the data among a set of null plots, and engage a human observer to select the plot that is most different from the rest. If the data plot is selected it corresponds to the rejection of a null hypothesis. Metrics are calculated in association with lineups, to measure the quality of the lineup, and help to understand what people see in the data plots. The null plots represent a finite sample from a null distribution, and the selected sample potentially affects the ease or difficulty of a lineup. Distance metrics are designed to describe how close the true data plot is to the null plots, and how close the null plots are to each other. The distribution of the distance metrics is studied to learn how well this matches to what people detect in the plots, the effect of null generating mechanism and plot choices for particular tasks. The analysis was conducted on data that has already been collected from Amazon Turk studies conducted with lineups for studying an array of data analysis tasks.

Dianne Cook

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Squintability and Other Metrics for Assessing Projection Pursuit Indexes, and Guiding Optimization Choices

A Journey from Wild to Textbook Data to Reproducibly Refresh the Wages Data from the National Longitudinal Survey of Youth Database

A Study on a User-Controlled Radial Tour for Variable Importance in High-Dimensional Data

Hole or grain? A Section Pursuit Index for Finding Hidden Structure in Multiple Dimensions

Using tours to visually investigate properties of new projection pursuit indexes with application to problems in physics

An algorithm for deciding the number of clusters and validation using simulated data with application to exploring crop population structure

Enabling Interactivity on Displays of Multivariate Time Series and Longitudinal Data

Four Papers on Contemporary Software Design Strategies for Statistical Methodologists

Human Factors Influencing Visual Statistical Inference

Utilizing Distance Metrics on Lineups to Examine What People Read From Data Plots