Source author record

Elliott Ash

Elliott Ash appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Artificial Intelligence econ.GN q-fin.EC

Catalog footprint

What is connected

6works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Can Reasoning Help Large Language Models Capture Human Annotator Disagreement?

Variation in human annotation (i.e., disagreements) is common in NLP, often reflecting important information like task subjectivity and sample ambiguity. Modeling this variation is important for applications that are sensitive to such information. Although RLVR-style reasoning (Reinforcement Learning with Verifiable Rewards) has improved Large Language Model (LLM) performance on many tasks, it remains unclear whether such reasoning enables LLMs to capture informative variation in human annotation. In this work, we evaluate the influence of different reasoning settings on LLM disagreement modeling. We systematically evaluate each reasoning setting across model sizes, distribution expression methods, and steering methods, resulting in 60 experimental setups across 3 tasks. Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling, while naive Chain-of-Thought (CoT) reasoning improves the performance of RLHF LLMs (RL from human feedback). These findings underscore the potential risk of replacing human annotators with reasoning LLMs, especially when disagreements are important.

preprint2026arXiv

pdfQA: Diverse, Challenging, and Realistic Question Answering over PDFs

PDFs are the second-most used document type on the internet (after HTML). Yet, existing QA datasets commonly start from text sources or only address specific domains. In this paper, we present pdfQA, a multi-domain 2K human-annotated (real-pdfQA) and 2K synthetic dataset (syn-pdfQA) differentiating QA pairs in ten complexity dimensions (e.g., file type, source modality, source position, answer type). We apply and evaluate quality and difficulty filters on both datasets, obtaining valid and challenging QA pairs. We answer the questions with open-source LLMs, revealing existing challenges that correlate with our complexity dimensions. pdfQA presents a basis for end-to-end QA pipeline evaluation, testing diverse skill sets and local optimizations (e.g., in information retrieval or parsing).

preprint2022arXiv

Heroes, Villains, and Victims, and GPT-3: Automated Extraction of Character Roles Without Training Data

This paper shows how to use large-scale pre-trained language models to extract character roles from narrative texts without training data. Queried with a zero-shot question-answering prompt, GPT-3 can identify the hero, villain, and victim in diverse domains: newspaper articles, movie plot summaries, and political speeches.

preprint2022arXiv

MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes

We introduce MemSum (Multi-step Episodic Markov decision process extractive SUMmarizer), a reinforcement-learning-based extractive summarizer enriched at each step with information on the current extraction history. When MemSum iteratively selects sentences into the summary, it considers a broad information set that would intuitively also be used by humans in this task: 1) the text content of the sentence, 2) the global text context of the rest of the document, and 3) the extraction history consisting of the set of sentences that have already been extracted. With a lightweight architecture, MemSum obtains state-of-the-art test-set performance (ROUGE) in summarizing long documents taken from PubMed, arXiv, and GovReport. Ablation studies demonstrate the importance of local, global, and history information. A human evaluation confirms the high quality and low redundancy of the generated summaries, stemming from MemSum's awareness of extraction history.

preprint2022arXiv

RELATIO: Text Semantics Capture Political and Economic Narratives

Social scientists have become increasingly interested in how narratives -- the stories in fiction, politics, and life -- shape beliefs, behavior, and government policies. This paper provides an unsupervised method to quantify latent narrative structures in text documents. Our new software package RELATIO identifies coherent entity groups and maps explicit relations between them in the text. We provide an application to the United States Congressional Record to analyze political and economic narratives in recent decades. Our analysis highlights the dynamics, sentiment, polarization, and interconnectedness of narratives in political discourse.

preprint2022arXiv

The Choice of Knowledge Base in Automated Claim Checking

Automated claim checking is the task of determining the veracity of a claim given evidence found in a knowledge base of trustworthy facts. While previous work has taken the knowledge base as given and optimized the claim-checking pipeline, we take the opposite approach - taking the pipeline as given, we explore the choice of knowledge base. Our first insight is that a claim-checking pipeline can be transferred to a new domain of claims with access to a knowledge base from the new domain. Second, we do not find a "universally best" knowledge base - higher domain overlap of a task dataset and a knowledge base tends to produce better label accuracy. Third, combining multiple knowledge bases does not tend to improve performance beyond using the closest-domain knowledge base. Finally, we show that the claim-checking pipeline's confidence score for selecting evidence can be used to assess whether a knowledge base will perform well for a new set of claims, even in the absence of ground-truth labels.

Elliott Ash

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

Can Reasoning Help Large Language Models Capture Human Annotator Disagreement?

pdfQA: Diverse, Challenging, and Realistic Question Answering over PDFs

Heroes, Villains, and Victims, and GPT-3: Automated Extraction of Character Roles Without Training Data

MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes

RELATIO: Text Semantics Capture Political and Economic Narratives

The Choice of Knowledge Base in Automated Claim Checking