Source author record

Matthew Lease

Matthew Lease appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Human-Computer Interaction Computation and Language Information Retrieval Artificial Intelligence cs.CY Machine Learning Computational Geometry Computer Vision math.OC Neural and Evolutionary Computing

Catalog footprint

What is connected

24works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

PIE: Performance Interval Estimation for Free-Form Generation Tasks

Confidence estimation infers a probability for whether each model output is correct or not. While predicting such binary correctness is sensible for tasks with exact answers, free-form generation tasks are often more nuanced, with output quality being both fine-grained and multi-faceted. We thus propose Performance Interval Estimation (PIE) to predict both: 1) point estimates for any arbitrary set of continuous-valued evaluation metrics; and 2) calibrated uncertainty intervals around these point estimates. We then compare two approaches: LLM-as-judge vs. classic regression with confidence estimation features. Evaluation over 11 datasets spans summarization, translation, code generation, function-calling, and question answering. Regression is seen to achieve both: i) lower error point estimates of metric scores; and ii) well-calibrated uncertainty intervals. To support reproduction and follow-on work, we share our data and code.

preprint2026arXiv

Who Owns Creativity and Who Does the Work? Trade-offs in LLM-Supported Research Ideation

LLM-based agents offer new potential to accelerate science and reshape research work. However, the quality of researcher contributions can vary significantly depending on human ability to steer agent behaviors. How can we best use these tools to augment scientific creativity without undermining aspects of contribution and ownership that drive research? To investigate this, we developed an agentic research ideation system integrating three roles -- Ideator, Writer, and Evaluator -- across three control levels -- Low, Medium, and Intensive. Our mixed-methods study with 54 researchers suggests three key findings in how LLM-based agents reshape scientific creativity: 1) perceived creativity support does not simply increase linearly with greater control; 2) human effort shifts from ideating to verifying ideas; and 3) ownership becomes a negotiated outcome between human and AI. Our findings suggest that LLM agent design should emphasize researcher empowerment, fostering a sense of ownership over strong ideas rather than reducing researchers to operating an automated AI-driven process.

preprint2023arXiv

The State of Human-centered NLP Technology for Fact-checking

Misinformation threatens modern society by promoting distrust in science, changing narratives in public health, heightening social polarization, and disrupting democratic elections and financial markets, among a myriad of other societal harms. To address this, a growing cadre of professional fact-checkers and journalists provide high-quality investigations into purported facts. However, these largely manual efforts have struggled to match the enormous scale of the problem. In response, a growing body of Natural Language Processing (NLP) technologies have been proposed for more scalable fact-checking. Despite tremendous growth in such research, however, practical adoption of NLP technologies for fact-checking still remains in its infancy today. In this work, we review the capabilities and limitations of the current NLP technologies for fact-checking. Our particular focus is to further chart the design space for how these technologies can be harnessed and refined in order to better meet the needs of human fact-checkers. To do so, we review key aspects of NLP-based fact-checking: task formulation, dataset construction, modeling, and human-centered strategies, such as explainable models and human-in-the-loop approaches. Next, we review the efficacy of applying NLP-based fact-checking tools to assist human fact-checkers. We recommend that future research include collaboration with fact-checker stakeholders early on in NLP research, as well as incorporation of human-centered design practices in model development, in order to further guide technology development for human use and practical adoption. Finally, we advocate for more research on benchmark development supporting extrinsic evaluation of human-centered fact-checking technologies.

preprint2023arXiv

Voices of Workers: Why a Worker-Centered Approach to Crowd Work Is Challenging

How can we better understand the broad, diverse, shifting, and invisible crowd workforce, so that we can better support it? We present findings from online observations and analysis of publicly available postings from a community forum of crowd workers. In particular, we observed recurring tensions between crowd workers and journalists regarding media depictions of crowd work. We found that crowd diversity makes any one-dimensional representation inadequate in addressing the wide-ranging experiences of crowd work. We argue that the scale, diversity, invisibility, and the crowds' resistance to publicity make a worker-centered approach to crowd work particularly challenging, necessitating better understanding the diversity of workers and their lived experiences.

preprint2022arXiv

Designing Closed Human-in-the-loop Deferral Pipelines

In hybrid human-machine deferral frameworks, a classifier can defer uncertain cases to human decision-makers (who are often themselves fallible). Prior work on simultaneous training of such classifier and deferral models has typically assumed access to an oracle during training to obtain true class labels for training samples, but in practice there often is no such oracle. In contrast, we consider a "closed" decision-making pipeline in which the same fallible human decision-makers used in deferral also provide training labels. How can imperfect and biased human expert labels be used to train a fair and accurate deferral framework? Our key insight is that by exploiting weak prior information, we can match experts to input examples to ensure fairness and accuracy of the resulting deferral framework, even when imperfect and biased experts are used in place of ground truth labels. The efficacy of our approach is shown both by theoretical analysis and by evaluation on two tasks.

preprint2022arXiv

longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A systematic approach to adversarial attacks

Developing methods to adversarially challenge NLP systems is a promising avenue for improving both model performance and interpretability. Here, we describe the approach of the team "longhorns" on Task 1 of the The First Workshop on Dynamic Adversarial Data Collection (DADC), which asked teams to manually fool a model on an Extractive Question Answering task. Our team finished first, with a model error rate of 62%. We advocate for a systematic, linguistically informed approach to formulating adversarial questions, and we describe the results of our pilot experiments, as well as our official submission.

preprint2022arXiv

ProtoTEx: Explaining Model Decisions with Prototype Tensors

We present ProtoTEx, a novel white-box NLP classification architecture based on prototype networks. ProtoTEx faithfully explains model decisions based on prototype tensors that encode latent clusters of training examples. At inference time, classification decisions are based on the distances between the input text and the prototype tensors, explained via the training examples most similar to the most influential prototypes. We also describe a novel interleaved training algorithm that effectively handles classes characterized by the absence of indicative features. On a propaganda detection task, ProtoTEx accuracy matches BART-large and exceeds BERT-large with the added benefit of providing faithful explanations. A user study also shows that prototype-based explanations help non-experts to better recognize propaganda in online news.

preprint2022arXiv

Scalable Unidirectional Pareto Optimality for Multi-Task Learning with Constraints

Multi-objective optimization (MOO) problems require balancing competing objectives, often under constraints. The Pareto optimal solution set defines all possible optimal trade-offs over such objectives. In this work, we present a novel method for Pareto-front learning: inducing the full Pareto manifold at train-time so users can pick any desired optimal trade-off point at run-time. Our key insight is to exploit Fritz-John Conditions for a novel guided double gradient descent strategy. Evaluation on synthetic benchmark problems allows us to vary MOO problem difficulty in controlled fashion and measure accuracy vs. known analytic solutions. We further test scalability and generalization in learning optimal neural model parameterizations for Multi-Task Learning (MTL) on image classification. Results show consistent improvement in accuracy and efficiency over prior MTL methods as well as techniques from operations research.

preprint2022arXiv

The Case for Claim Difficulty Assessment in Automatic Fact Checking

Fact-checking is the process of evaluating the veracity of claims (i.e., purported facts). In this opinion piece, we raise an issue that has received little attention in prior work -- that some claims are far more difficult to fact-check than others. We discuss the implications this has for both practical fact-checking and research on automated fact-checking, including task formulation and dataset design. We report a manual analysis undertaken to explore factors underlying varying claim difficulty and identify several distinct types of difficulty. We motivate this new claim difficulty prediction task as beneficial to both automated fact-checking and practical fact-checking organizations.

preprint2022arXiv

The Effects of Interactive AI Design on User Behavior: An Eye-tracking Study of Fact-checking COVID-19 Claims

We conducted a lab-based eye-tracking study to investigate how the interactivity of an AI-powered fact-checking system affects user interactions, such as dwell time, attention, and mental resources involved in using the system. A within-subject experiment was conducted, where participants used an interactive and a non-interactive version of a mock AI fact-checking system and rated their perceived correctness of COVID-19 related claims. We collected web-page interactions, eye-tracking data, and mental workload using NASA-TLX. We found that the presence of the affordance of interactively manipulating the AI system's prediction parameters affected users' dwell times, and eye-fixations on AOIs, but not mental workload. In the interactive system, participants spent the most time evaluating claims' correctness, followed by reading news. This promising result shows a positive role of interactivity in a mixed-initiative AI-powered system.

preprint2022arXiv

Understanding and Predicting Characteristics of Test Collections in Information Retrieval

Research community evaluations in information retrieval, such as NIST's Text REtrieval Conference (TREC), build reusable test collections by pooling document rankings submitted by many teams. Naturally, the quality of the resulting test collection thus greatly depends on the number of participating teams and the quality of their submitted runs. In this work, we investigate: i) how the number of participants, coupled with other factors, affects the quality of a test collection; and ii) whether the quality of a test collection can be inferred prior to collecting relevance judgments from human assessors. Experiments conducted on six TREC collections illustrate how the number of teams interacts with various other factors to influence the resulting quality of test collections. We also show that the reusability of a test collection can be predicted with high accuracy when the same document collection is used for successive years in an evaluation campaign, as is common in TREC.

preprint2021arXiv

A Hybrid 2-stage Neural Optimization for Pareto Front Extraction

Classification, recommendation, and ranking problems often involve competing goals with additional constraints (e.g., to satisfy fairness or diversity criteria). Such optimization problems are quite challenging, often involving non-convex functions along with considerations of user preferences in balancing trade-offs. Pareto solutions represent optimal frontiers for jointly optimizing multiple competing objectives. A major obstacle for frequently used linear-scalarization strategies is that the resulting optimization problem might not always converge to a global optimum. Furthermore, such methods only return one solution point per run. A Pareto solution set is a subset of all such global optima over multiple runs for different trade-off choices. Therefore, a Pareto front can only be guaranteed with multiple runs of the linear-scalarization problem, where all runs converge to their respective global optima. Consequently, extracting a Pareto front for practical problems is computationally intractable with substantial computational overheads, limited scalability, and reduced accuracy. We propose a robust, low cost, two-stage, hybrid neural Pareto optimization approach that is accurate and scales (compute space and time) with data dimensions, as well as number of functions and constraints. The first stage (neural network) efficiently extracts a weak Pareto front, using Fritz-John conditions as the discriminator, with no assumptions of convexity on the objectives or constraints. The second stage (efficient Pareto filter) extracts the strong Pareto optimal subset given the weak front from stage 1. Fritz-John conditions provide us with theoretical bounds on approximation error between the true and network extracted weak Pareto front. Numerical experiments demonstrates the accuracy and efficiency on a canonical set of benchmark problems and a fairness optimization task from prior works.

preprint2021arXiv

Extracting Optimal Solution Manifolds using Constrained Neural Optimization

Constrained Optimization solution algorithms are restricted to point based solutions. In practice, single or multiple objectives must be satisfied, wherein both the objective function and constraints can be non-convex resulting in multiple optimal solutions. Real world scenarios include intersecting surfaces as Implicit Functions, Hyperspectral Unmixing and Pareto Optimal fronts. Local or global convexification is a common workaround when faced with non-convex forms. However, such an approach is often restricted to a strict class of functions, deviation from which results in sub-optimal solution to the original problem. We present neural solutions for extracting optimal sets as approximate manifolds, where unmodified, non-convex objectives and constraints are defined as modeler guided, domain-informed $L_2$ loss function. This promotes interpretability since modelers can confirm the results against known analytical forms in their specific domains. We present synthetic and realistic cases to validate our approach and compare against known solvers for bench-marking in terms of accuracy and computational efficiency.

preprint2020arXiv

But Who Protects the Moderators? The Case of Crowdsourced Image Moderation

Though detection systems have been developed to identify obscene content such as pornography and violence, artificial intelligence is simply not good enough to fully automate this task yet. Due to the need for manual verification, social media companies may hire internal reviewers, contract specialized workers from third parties, or outsource to online labor markets for the purpose of commercial content moderation. These content moderators are often fully exposed to extreme content and may suffer lasting psychological and emotional damage. In this work, we aim to alleviate this problem by investigating the following question: How can we reveal the minimum amount of information to a human reviewer such that an objectionable image can still be correctly identified? We design and conduct experiments in which blurred graphic and non-graphic images are filtered by human moderators on Amazon Mechanical Turk (AMT). We observe how obfuscation affects the moderation experience with respect to image classification accuracy, interface usability, and worker emotional well-being.

preprint2020arXiv

Efficient Test Collection Construction via Active Learning

To create a new IR test collection at low cost, it is valuable to carefully select which documents merit human relevance judgments. Shared task campaigns such as NIST TREC pool document rankings from many participating systems (and often interactive runs as well) in order to identify the most likely relevant documents for human judging. However, if one's primary goal is merely to build a test collection, it would be useful to be able to do so without needing to run an entire shared task. Toward this end, we investigate multiple active learning strategies which, without reliance on system rankings: 1) select which documents human assessors should judge; and 2) automatically classify the relevance of additional unjudged documents. To assess our approach, we report experiments on five TREC collections with varying scarcity of relevant documents. We report labeling accuracy achieved, as well as rank correlation when evaluating participant systems based upon these labels vs.\ full pool judgments. Results show the effectiveness of our approach, and we further analyze how varying relevance scarcity across collections impacts our findings. To support reproducibility and follow-on work, we have shared our code online: https://github.com/mdmustafizurrahman/ICTIR_AL_TestCollection_2020/.

preprint2016arXiv

Active Discriminative Text Representation Learning

We propose a new active learning (AL) method for text classification with convolutional neural networks (CNNs). In AL, one selects the instances to be manually labeled with the aim of maximizing model performance with minimal effort. Neural models capitalize on word embeddings as representations (features), tuning these to the task at hand. We argue that AL strategies for multi-layered neural models should focus on selecting instances that most affect the embedding space (i.e., induce discriminative word representations). This is in contrast to traditional AL approaches (e.g., entropy-based uncertainty sampling), which specify higher level objectives. We propose a simple approach for sentence classification that selects instances containing words whose embeddings are likely to be updated with the greatest magnitude, thereby rapidly learning discriminative, task-specific embeddings. We extend this approach to document classification by jointly considering: (1) the expected changes to the constituent word representations; and (2) the model's current overall uncertainty regarding the instance. The relative emphasis placed on these criteria is governed by a stochastic process that favors selecting instances likely to improve representations at the outset of learning, and then shifts toward general uncertainty sampling as AL progresses. Empirical results show that our method outperforms baseline AL approaches on both sentence and document classification tasks. We also show that, as expected, the method quickly learns discriminative word embeddings. To the best of our knowledge, this is the first work on AL addressing neural models for text classification.

preprint2016arXiv

Crowdsourcing Information Extraction for Biomedical Systematic Reviews

Information extraction is a critical step in the practice of conducting biomedical systematic literature reviews. Extracted structured data can be aggregated via methods such as statistical meta-analysis. Typically highly trained domain experts extract data for systematic reviews. The high expense of conducting biomedical systematic reviews has motivated researchers to explore lower cost methods that achieve similar rigor without compromising quality. Crowdsourcing represents one such promising approach. In this work-in-progress study, we designed a crowdsourcing task for biomedical information extraction. We briefly report the iterative design process and the results of two pilot testings. We found that giving more concrete examples in the task instruction can help workers better understand the task, especially for concepts that are abstract and confusing. We found a few workers completed most of the work, and our payment level appeared more attractive to workers from low-income countries. In the future, we will further evaluate our results with reference to gold standard extractions, thus assessing the feasibility of tasking crowd workers with extracting biomedical intervention information for systematic reviews.

preprint2014arXiv

Bullseye: Structured Passage Retrieval and Document Highlighting for Scholarly Search

We present the Bullseye system for scholarly search. Given a collection of research papers, Bullseye: 1) identifies relevant passages using any on-the-shelf algorithm; 2) automatically detects document structure and restricts retrieved passages to user-specifed sections; and 3) highlights those passages for each PDF document retrieved. We evaluate Bullseye with regard to three aspects: system effectiveness, user effectiveness, and user effort. In a system-blind evaluation, users were asked to compare passage retrieval using Bullseye vs. a baseline which ignores document structure, in regard to four types of graded assessments. Results show modest improvement in system effectiveness while both user effectiveness and user effort show substantial improvement. Users also report very strong demand for passage highlighting in scholarly search across both systems considered.

preprint2014arXiv

TurKPF: TurKontrol as a Particle Filter

TurKontrol, and algorithm presented in (Dai et al. 2010), uses a POMDP to model and control an iterative workflow for crowdsourced work. Here, TurKontrol is re-implemented as "TurKPF," which uses a Particle Filter to reduce computation time & memory usage. Most importantly, in our experimental environment with default parameter settings, the action is chosen nearly instantaneously. Through a series of experiments we see that TurKPF and TurKontrol perform similarly.

preprint2013arXiv

Beyond AMT: An Analysis of Crowd Work Platforms

While Amazon's Mechanical Turk (AMT) helped launch the paid crowd work industry eight years ago, many new vendors now offer a range of alternative models. Despite this, little crowd work research has explored other platforms. Such near-exclusive focus risks letting AMT's particular vagaries and limitations overly shape our understanding of crowd work and the research questions and directions being pursued. To address this, we present a cross-platform content analysis of seven crowd work platforms. We begin by reviewing how AMT assumptions and limitations have influenced prior research. Next, we formulate key criteria for characterizing and differentiating crowd work platforms. Our analysis of platforms contrasts them with AMT, informing both methodology of use and directions for future research. Our cross-platform analysis represents the only such study by researchers for researchers, intended to further enrich the diversity of research on crowd work and accelerate progress.

preprint2013arXiv

Crowdsourced Task Routing via Matrix Factorization

We describe methods to predict a crowd worker's accuracy on new tasks based on his accuracy on past tasks. Such prediction provides a foundation for identifying the best workers to route work to in order to maximize accuracy on the new task. Our key insight is to model similarity of past tasks to the target task such that past task accuracies can be optimally integrated to predict target task accuracy. We describe two matrix factorization (MF) approaches from collaborative filtering which not only exploit such task similarity, but are known to be robust to sparse data. Experiments on synthetic and real-world datasets provide feasibility assessment and comparative evaluation of MF approaches vs. two baseline methods. Across a range of data scales and task similarity conditions, we evaluate: 1) prediction error over all workers; and 2) how well each method predicts the best workers to use for each task. Results show the benefit of task routing over random assignment, the strength of probabilistic MF over baseline methods, and the robustness of methods under different conditions.

preprint2012arXiv

Crowdsourcing for Usability Testing

While usability evaluation is critical to designing usable websites, traditional usability testing can be both expensive and time consuming. The advent of crowdsourcing platforms such as Amazon Mechanical Turk and CrowdFlower offer an intriguing new avenue for performing remote usability testing with potentially many users, quick turn-around, and significant cost savings. To investigate the potential of such crowdsourced usability testing, we conducted two similar (though not completely parallel) usability studies which evaluated a graduate school's website: one via a traditional usability lab setting, and the other using crowdsourcing. While we find crowdsourcing exhibits some notable limitations in comparison to the traditional lab environment, its applicability and value for usability testing is clearly evidenced. We discuss both methodological differences for crowdsourced usability testing, as well as empirical contrasts to results from more traditional, face-to-face usability testing.

preprint2012arXiv

Dating Texts without Explicit Temporal Cues

This paper tackles temporal resolution of documents, such as determining when a document is about or when it was written, based only on its text. We apply techniques from information retrieval that predict dates via language models over a discretized timeline. Unlike most previous works, we rely {\it solely} on temporal cues implicit in the text. We consider both document-likelihood and divergence based techniques and several smoothing methods for both of them. Our best model predicts the mid-point of individuals' lives with a median of 22 and mean error of 36 years for Wikipedia biographies from 3800 B.C. to the present day. We also show that this approach works well when training on such biographies and predicting dates both for non-biographical Wikipedia pages about specific years (500 B.C. to 2010 A.D.) and for publication dates of short stories (1798 to 2008). Together, our work shows that, even in absence of temporal extraction resources, it is possible to achieve remarkable temporal locality across a diverse set of texts.

preprint2012arXiv

Evaluating Classifiers Without Expert Labels

This paper considers the challenge of evaluating a set of classifiers, as done in shared task evaluations like the KDD Cup or NIST TREC, without expert labels. While expert labels provide the traditional cornerstone for evaluating statistical learners, limited or expensive access to experts represents a practical bottleneck. Instead, we seek methodology for estimating performance of the classifiers which is more scalable than expert labeling yet preserves high correlation with evaluation based on expert labels. We consider both: 1) using only labels automatically generated by the classifiers (blind evaluation); and 2) using labels obtained via crowdsourcing. While crowdsourcing methods are lauded for scalability, using such data for evaluation raises serious concerns given the prevalence of label noise. In regard to blind evaluation, two broad strategies are investigated: combine & score and score & combine methods infer a single pseudo-gold label set by aggregating classifier labels; classifiers are then evaluated based on this single pseudo-gold label set. On the other hand, score & combine methods: 1) sample multiple label sets from classifier outputs, 2) evaluate classifiers on each label set, and 3) average classifier performance across label sets. When additional crowd labels are also collected, we investigate two alternative avenues for exploiting them: 1) direct evaluation of classifiers; or 2) supervision of combine & score methods. To assess generality of our techniques, classifier performance is measured using four common classification metrics, with statistical significance tests. Finally, we measure both score and rank correlations between estimated classifier performance vs. actual performance according to expert judgments. Rigorous evaluation of classifiers from the TREC 2011 Crowdsourcing Track shows reliable evaluation can be achieved without reliance on expert labels.

Matthew Lease

What is connected

Connect this record

See the researcher in context

Building this map preview

24 published item(s)

PIE: Performance Interval Estimation for Free-Form Generation Tasks

Who Owns Creativity and Who Does the Work? Trade-offs in LLM-Supported Research Ideation

The State of Human-centered NLP Technology for Fact-checking

Voices of Workers: Why a Worker-Centered Approach to Crowd Work Is Challenging

Designing Closed Human-in-the-loop Deferral Pipelines

longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A systematic approach to adversarial attacks

ProtoTEx: Explaining Model Decisions with Prototype Tensors

Scalable Unidirectional Pareto Optimality for Multi-Task Learning with Constraints

The Case for Claim Difficulty Assessment in Automatic Fact Checking

The Effects of Interactive AI Design on User Behavior: An Eye-tracking Study of Fact-checking COVID-19 Claims

Understanding and Predicting Characteristics of Test Collections in Information Retrieval

A Hybrid 2-stage Neural Optimization for Pareto Front Extraction

Extracting Optimal Solution Manifolds using Constrained Neural Optimization

But Who Protects the Moderators? The Case of Crowdsourced Image Moderation

Efficient Test Collection Construction via Active Learning

Active Discriminative Text Representation Learning

Crowdsourcing Information Extraction for Biomedical Systematic Reviews

Bullseye: Structured Passage Retrieval and Document Highlighting for Scholarly Search

TurKPF: TurKontrol as a Particle Filter

Beyond AMT: An Analysis of Crowd Work Platforms

Crowdsourced Task Routing via Matrix Factorization

Crowdsourcing for Usability Testing

Dating Texts without Explicit Temporal Cues

Evaluating Classifiers Without Expert Labels