Source author record

Alexander Spangher

Alexander Spangher appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Digital Libraries Machine Learning Artificial Intelligence Human-Computer Interaction physics.plasm-ph

Catalog footprint

What is connected

8works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Autoregressive Transformers for Disruption Prediction in Nuclear Fusion Plasmas

The physical sciences require models tailored to specific nuances of different dynamics. In this work, we study outcome predictions in nuclear fusion tokamaks, where a major challenge are \textit{disruptions}, or the loss of plasma stability with damaging implications for the tokamak. Although disruptions are difficult to model using physical simulations, machine learning (ML) models have shown promise in predicting these phenomena. Here, we first study several variations on masked autoregressive transformers, achieving an average of 5\% increase in Area Under the Receiving Operating Characteristic metric above existing methods. We then compare transformer models to limited context neural networks in order to shed light on the ``memory'' of plasma effected by tokamaks controls. With these model comparisons, we argue for the persistence of a memory throughout the plasma \textit{in the context of tokamaks} that our model exploits.

preprint2023arXiv

Sequentially Controlled Text Generation

While GPT-2 generates sentences that are remarkably human-like, longer documents can ramble and do not follow human-like writing structure. We study the problem of imposing structure on long-range text. We propose a novel controlled text generation task, sequentially controlled text generation, and identify a dataset, NewsDiscourse as a starting point for this task. We develop a sequential controlled text generation pipeline with generation and editing. We test different degrees of structural awareness and show that, in general, more structural awareness results in higher control-accuracy, grammaticality, coherency and topicality, approaching human-level writing performance.

preprint2022arXiv

If it Bleeds, it Leads: A Computational Approach to Covering Crime in Los Angeles

Developing and improving computational approaches to covering news can increase journalistic output and improve the way stories are covered. In this work we approach the problem of covering crime stories in Los Angeles. We present a machine-in-the-loop system that covers individual crimes by (1) learning the prototypical coverage archetypes from classical news articles on crime to learn their structure and (2) using output from the Los Angeles Police department to generate "lede paragraphs", first structural unit of crime-articles. We introduce a probabilistic graphical model for learning article structure and a rule-based system for generating ledes. We hope our work can lead to systems that use these components together to form the skeletons of news articles covering crime. This work was done for a class project in Jonathan May's Advanced Natural Language Processing Course, Fall, 2019.

preprint2022arXiv

NewsEdits: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing)

News article revision histories have the potential to give us novel insights across varied fields of linguistics and social sciences. In this work, we present, to our knowledge, the first publicly available dataset of news article revision histories, or NewsEdits. Our dataset is multilingual; it contains 1,278,804 articles with 4,609,430 versions from over 22 English- and French-language newspaper sources based in three countries. Across version pairs, we count 10.9 million added sentences; 8.9 million changed sentences and 6.8 million removed sentences. Within the changed sentences, we derive 72 million atomic edits. NewsEdits is, to our knowledge, the largest corpus of revision histories of any domain.

preprint2022arXiv

NewsEdits: A News Article Revision Dataset and a Document-Level Reasoning Challenge

News article revision histories provide clues to narrative and factual evolution in news articles. To facilitate analysis of this evolution, we present the first publicly available dataset of news revision histories, NewsEdits. Our dataset is large-scale and multilingual; it contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources based in three countries, spanning 15 years of coverage (2006-2021). We define article-level edit actions: Addition, Deletion, Edit and Refactor, and develop a high-accuracy extraction algorithm to identify these actions. To underscore the factual nature of many edit actions, we conduct analyses showing that added and deleted sentences are more likely to contain updating events, main content and quotes than unchanged sentences. Finally, to explore whether edit actions are predictable, we introduce three novel tasks aimed at predicting actions performed during version updates. We show that these tasks are possible for expert humans but are challenging for large NLP models. We hope this can spur research in narrative framing and help provide predictive tools for journalists chasing breaking news.

preprint2022arXiv

StateCensusLaws.org: A Web Application for Consuming and Annotating Legal Discourse Learning

In this work, we create a web application to highlight the output of NLP models trained to parse and label discourse segments in law text. Our system is built primarily with journalists and legal interpreters in mind, and we focus on state-level law that uses U.S. Census population numbers to allocate resources and organize government. Our system exposes a corpus we collect of 6,000 state-level laws that pertain to the U.S. census, using 25 scrapers we built to crawl state law websites, which we release. We also build a novel, flexible annotation framework that can handle span-tagging and relation tagging on an arbitrary input text document and be embedded simply into any webpage. This framework allows journalists and researchers to add to our annotation database by correcting and tagging new data.

preprint2021arXiv

Multitask Learning for Class-Imbalanced Discourse Classification

Small class-imbalanced datasets, common in many high-level semantic tasks like discourse analysis, present a particular challenge to current deep-learning architectures. In this work, we perform an extensive analysis on sentence-level classification approaches for the News Discourse dataset, one of the largest high-level semantic discourse datasets recently published. We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks, due in part to label corrections across tasks, which improve performance for underrepresented classes. We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP, and show that none of these approaches can improve classification accuracy in such a setting.

preprint2015arXiv

Bayesian Nonparametrics in Topic Modeling: A Brief Tutorial

Using nonparametric methods has been increasingly explored in Bayesian hierarchical modeling as a way to increase model flexibility. Although the field shows a lot of promise, inference in many models, including Hierachical Dirichlet Processes (HDP), remain prohibitively slow. One promising path forward is to exploit the submodularity inherent in Indian Buffet Process (IBP) to derive near-optimal solutions in polynomial time. In this work, I will present a brief tutorial on Bayesian nonparametric methods, especially as they are applied to topic modeling. I will show a comparison between different non-parametric models and the current state-of-the-art parametric model, Latent Dirichlet Allocation (LDA).

Alexander Spangher

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Autoregressive Transformers for Disruption Prediction in Nuclear Fusion Plasmas

Sequentially Controlled Text Generation

If it Bleeds, it Leads: A Computational Approach to Covering Crime in Los Angeles

NewsEdits: A Dataset of Revision Histories for News Articles (Technical Report: Data Processing)

NewsEdits: A News Article Revision Dataset and a Document-Level Reasoning Challenge

StateCensusLaws.org: A Web Application for Consuming and Annotating Legal Discourse Learning

Multitask Learning for Class-Imbalanced Discourse Classification

Bayesian Nonparametrics in Topic Modeling: A Brief Tutorial