Researcher profile

Christopher M. Danforth

Christopher M. Danforth contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
12topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2025arXiv

Statistical laws and linguistics inform meaning in naturalistic and fictional conversation

Conversation is a cornerstone of social connection and is linked to well-being outcomes. Conversations vary widely in type with some portion generating complex, dynamic stories. One approach to studying how conversations unfold in time is through statistical patterns such as Heaps' law, which holds that vocabulary size scales with document length. Little work on Heaps' law has looked at conversation and considered how language features impact scaling. We measure Heaps' law for conversations recorded in two distinct mediums: 1. Strangers brought together on video chat and 2. Fictional characters in movies. We find that scaling of vocabulary size differs by parts of speech. We discuss these findings through behavioral and linguistic frameworks.

preprint2022arXiv

Sirius: Visualization of Mixed Features as a Mutual Information Network Graph

Data scientists across disciplines are increasingly in need of exploratory analysis tools for data sets with a high volume of features of mixed data type (quantitative continuous and discrete categorical). We introduce Sirius, a novel visualization package for researchers to explore feature relationships among mixed data types using mutual information. The visualization of feature relationships aids data scientists in finding meaningful dependence among features prior to the development of predictive modeling pipelines, which can inform downstream analysis such as feature selection, feature extraction, and early detection of potential proxy variables. Using an information theoretic approach, Sirius supports network visualization of heterogeneous data sets (consisting of continuous and discrete data types), and provides a user interface for exploring feature pairs with locally significant mutual information scores. Mutual information algorithm and bivariate chart types are assigned on a data type pairing basis (continuous-continuous, discrete-discrete, and discrete-continuous). We show how this tool can be used for tasks such as hypothesis confirmation, identification of predictive features, suggestions for feature extraction, or early warning of data abnormalities. The accompanying website for this paper can be accessed at https://sirius.universalities.com/. All code and supplemental materials can be accessed at https://osf.io/pdm9r/.

preprint2022arXiv

Spatial changes in park visitation at the onset of the pandemic

The COVID-19 pandemic disrupted the mobility patterns of a majority of Americans beginning in March 2020. Despite the beneficial, socially distanced activity offered by outdoor recreation, confusing and contradictory public health messaging complicated access to natural spaces. Working with a dataset comprising the locations of roughly 50 million distinct mobile devices in 2019 and 2020, we analyze weekly visitation patterns for 8,135 parks across the United States. Using Bayesian inference, we identify regions that experienced a substantial change in visitation in the first few weeks of the pandemic. We find that regions that did not exhibit a change were likely to have smaller populations, and to have voted more republican than democrat in the 2020 elections. Our study contributes to a growing body of literature using passive observations to explore who benefits from access to nature.

preprint2021arXiv

Augmenting semantic lexicons using word embeddings and transfer learning

Sentiment-aware intelligent systems are essential to a wide array of applications. These systems are driven by language models which broadly fall into two paradigms: Lexicon-based and contextual. Although recent contextual models are increasingly dominant, we still see demand for lexicon-based models because of their interpretability and ease of use. For example, lexicon-based models allow researchers to readily determine which words and phrases contribute most to a change in measured sentiment. A challenge for any lexicon-based approach is that the lexicon needs to be routinely expanded with new words and expressions. Here, we propose two models for automatic lexicon expansion. Our first model establishes a baseline employing a simple and shallow neural network initialized with pre-trained word embeddings using a non-contextual approach. Our second model improves upon our baseline, featuring a deep Transformer-based network that brings to bear word definitions to estimate their lexical polarity. Our evaluation shows that both models are able to score new words with a similar accuracy to reviewers from Amazon Mechanical Turk, but at a fraction of the cost.

preprint2021arXiv

The sociospatial factors of death: Analyzing effects of geospatially-distributed variables in a Bayesian mortality model for Hong Kong

Human mortality is in part a function of multiple socioeconomic factors that differ both spatially and temporally. Adjusting for other covariates, the human lifespan is positively associated with household wealth. However, the extent to which mortality in a geographical region is a function of socioeconomic factors in both that region and its neighbors is unclear. There is also little information on the temporal components of this relationship. Using the districts of Hong Kong over multiple census years as a case study, we demonstrate that there are differences in how wealth indicator variables are associated with longevity in (a) areas that are affluent but neighbored by socially deprived districts versus (b) wealthy areas surrounded by similarly wealthy districts. We also show that the inclusion of spatially-distributed variables reduces uncertainty in mortality rate predictions in each census year when compared with a baseline model. Our results suggest that geographic mortality models should incorporate nonlocal information (e.g., spatial neighbors) to lower the variance of their mortality estimates, and point to a more in-depth analysis of sociospatial spillover effects on mortality rates.

preprint2020arXiv

Characterizing the Google Books corpus: Strong limits to inferences of socio-cultural and linguistic evolution

It is tempting to treat frequency trends from the Google Books data sets as indicators of the "true" popularity of various words and phrases. Doing so allows us to draw quantitatively strong conclusions about the evolution of cultural perception of a given topic, such as time or gender. However, the Google Books corpus suffers from a number of limitations which make it an obscure mask of cultural popularity. A primary issue is that the corpus is in effect a library, containing one of each book. A single, prolific author is thereby able to noticeably insert new phrases into the Google Books lexicon, whether the author is widely read or not. With this understood, the Google Books corpus remains an important data set to be considered more lexicon-like than text-like. Here, we show that a distinct problematic feature arises from the inclusion of scientific texts, which have become an increasingly substantive portion of the corpus throughout the 1900s. The result is a surge of phrases typical to academic articles but less common in general, such as references to time in the form of citations. We highlight these dynamics by examining and comparing major contributions to the statistical divergence of English data sets between decades in the period 1800--2000. We find that only the English Fiction data set from the second version of the corpus is not heavily affected by professional texts, in clear contrast to the first version of the fiction data set and both unfiltered English data sets. Our findings emphasize the need to fully characterize the dynamics of the Google Books corpus before using these data sets to draw broad conclusions about cultural and linguistic evolution.

preprint2020arXiv

Divergent modes of online collective attention to the COVID-19 pandemic are associated with future caseload variance

Using a random 10% sample of tweets authored from 2019-09-01 through 2020-04-30, we analyze the dynamic behavior of words (1-grams) used on Twitter to describe the ongoing COVID-19 pandemic. Across 24 languages, we find two distinct dynamic regimes: One characterizing the rise and subsequent collapse in collective attention to the initial Coronavirus outbreak in late January, and a second that represents March COVID-19-related discourse. Aggregating countries by dominant language use, we find that volatility in the first dynamic regime is associated with future volatility in new cases of COVID-19 roughly three weeks (average 22.49 $\pm$ 3.26 days) later. Our results suggest that surveillance of change in usage of epidemiology-related words on social media may be useful in forecasting later change in disease case numbers, but we emphasize that our current findings are not causal or necessarily predictive.

preprint2020arXiv

Generalized Word Shift Graphs: A Method for Visualizing and Explaining Pairwise Comparisons Between Texts

A common task in computational text analyses is to quantify how two corpora differ according to a measurement like word frequency, sentiment, or information content. However, collapsing the texts' rich stories into a single number is often conceptually perilous, and it is difficult to confidently interpret interesting or unexpected textual patterns without looming concerns about data artifacts or measurement validity. To better capture fine-grained differences between texts, we introduce generalized word shift graphs, visualizations which yield a meaningful and interpretable summary of how individual words contribute to the variation between two texts for any measure that can be formulated as a weighted average. We show that this framework naturally encompasses many of the most commonly used approaches for comparing texts, including relative frequencies, dictionary scores, and entropy-based measures like the Kullback-Leibler and Jensen-Shannon divergences. Through several case studies, we demonstrate how generalized word shift graphs can be flexibly applied across domains for diagnostic investigation, hypothesis generation, and substantive interpretation. By providing a detailed lens into textual shifts between corpora, generalized word shift graphs help computational social scientists, digital humanists, and other text analysis practitioners fashion more robust scientific narratives.

preprint2020arXiv

Local information sources received the most attention from Puerto Ricans during the aftermath of Hurricane María

In September 2017, Hurricane María made landfall across the Caribbean region as a category 4 storm. In the aftermath, many residents of Puerto Rico were without power or clean running water for nearly a year. Using both English and Spanish tweets from September 16 to October 15 2017, we investigate discussion of María both on and off the island, constructing a proxy for the temporal network of communication between victims of the hurricane and others. We use information theoretic tools to compare the lexical divergence of different subgroups within the network. Lastly, we quantify temporal changes in user prominence throughout the event. We find at the global level that Spanish tweets more often contained messages of hope and a focus on those helping. At the local level, we find that information propagating among Puerto Ricans most often originated from sources local to the island, such as journalists and politicians. Critically, content from these accounts overshadows content from celebrities, global news networks, and the like for the large majority of the time period studied. Our findings reveal insight into ways social media campaigns could be deployed to disseminate relief information during similar events in the future.

preprint2020arXiv

Noncooperative dynamics in election interference

Foreign power interference in domestic elections is an existential threat to societies. Manifested through myriad methods from war to words, such interference is a timely example of strategic interaction between economic and political agents. We model this interaction between rational game players as a continuous-time differential game, constructing an analytical model of this competition with a variety of payoff structures. All-or-nothing attitudes by only one player regarding the outcome of the game lead to an arms race in which both countries spend increasing amounts on interference and counter-interference operations. We then confront our model with data pertaining to the Russian interference in the 2016 United States presidential election contest. We introduce and estimate a Bayesian structural time series model of election polls and social media posts by Russian Twitter troll accounts. Our analytical model, while purposefully abstract and simple, adequately captures many temporal characteristics of the election and social media activity. We close with a discussion of our model's shortcomings and suggestions for future research.

preprint2020arXiv

The sleep loss insult of Spring Daylight Savings in the US is absorbed by Twitter users within 48 hours

Sleep loss has been linked to heart disease, diabetes, cancer, and an increase in accidents, all of which are among the leading causes of death in the United States. Population-scale sleep studies have the potential to advance public health by helping to identify at-risk populations, changes in collective sleep patterns, and to inform policy change. Prior research suggests other kinds of health indicators such as depression and obesity can be estimated using social media activity. However, the inability to effectively measure collective sleep with publicly available data has limited large-scale academic studies. Here, we investigate the passive estimation of sleep loss through a proxy analysis of Twitter activity profiles. We use "Spring Forward" events, which occur at the beginning of Daylight Savings Time in the United States, as a natural experimental condition to estimate spatial differences in sleep loss across the United States. On average, peak Twitter activity occurs roughly 45 minutes later on the Sunday following Spring Forward. By Monday morning however, activity curves are realigned with the week before, suggesting that at least on Twitter, the lost hour of early Sunday morning has been quickly absorbed.

preprint2019arXiv

Fragmentation and inefficiencies in US equity markets: Evidence from the Dow 30

Using the most comprehensive source of commercially available data on the US National Market System, we analyze all quotes and trades associated with Dow 30 stocks in 2016 from the vantage point of a single and fixed frame of reference. We find that inefficiencies created in part by the fragmentation of the equity marketplace are relatively common and persist for longer than what physical constraints may suggest. Information feeds reported different prices for the same equity more than 120 million times, with almost 64 million dislocation segments featuring meaningfully longer duration and higher magnitude. During this period, roughly 22% of all trades occurred while the SIP and aggregated direct feeds were dislocated. The current market configuration resulted in a realized opportunity cost totaling over $160 million when compared with a single feed, single exchange alternative---a conservative estimate that does not take into account intra-day offsetting events.

preprint2019arXiv

Hahahahaha, Duuuuude, Yeeessss!: A two-parameter characterization of stretchable words and the dynamics of mistypings and misspellings

Stretched words like `heellllp' or `heyyyyy' are a regular feature of spoken language, often used to emphasize or exaggerate the underlying meaning of the root word. While stretched words are rarely found in formal written language and dictionaries, they are prevalent within social media. In this paper, we examine the frequency distributions of `stretchable words' found in roughly 100 billion tweets authored over an 8 year period. We introduce two central parameters, `balance' and `stretch', that capture their main characteristics, and explore their dynamics by creating visual tools we call `balance plots' and `spelling trees'. We discuss how the tools and methods we develop here could be used to study the statistical patterns of mistypings and misspellings, along with the potential applications in augmenting dictionaries, improving language processing, and in any area where sequence construction matters, such as genetics.

preprint2017arXiv

Divergent discourse between protests and counter-protests: #BlackLivesMatter and #AllLivesMatter

Since the shooting of Black teenager Michael Brown by White police officer Darren Wilson in Ferguson, Missouri, the protest hashtag #BlackLivesMatter has amplified critiques of extrajudicial killings of Black Americans. In response to #BlackLivesMatter, other Twitter users have adopted #AllLivesMatter, a counter-protest hashtag whose content argues that equal attention should be given to all lives regardless of race. Through a multi-level analysis of over 860,000 tweets, we study how these protests and counter-protests diverge by quantifying aspects of their discourse. We find that #AllLivesMatter facilitates opposition between #BlackLivesMatter and hashtags such as #PoliceLivesMatter and #BlueLivesMatter in such a way that historically echoes the tension between Black protesters and law enforcement. In addition, we show that a significant portion of #AllLivesMatter use stems from hijacking by #BlackLivesMatter advocates. Beyond simply injecting #AllLivesMatter with #BlackLivesMatter content, these hijackers use the hashtag to directly confront the counter-protest notion of "All lives matter." Our findings suggest that Black Lives Matter movement was able to grow, exhibit diverse conversations, and avoid derailment on social media by making discussion of counter-protest opinions a central topic of #AllLivesMatter, rather than the movement itself.