Source author record

Jeffrey P. Bigham

Jeffrey P. Bigham appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Human-Computer Interaction Computation and Language eess.AS Sound Artificial Intelligence Machine Learning physics.soc-ph Social and Information Networks

Catalog footprint

What is connected

12works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement

Autoraters, also referred to as LLM-as-judges, are increasingly used for evaluation and automated content moderation. However, there is limited statistical analysis of how modifications in a rubric presented to both humans and autoraters affect their score agreement. Rubrics that ask for an overall or \emph{holistic} judgment - for example, rating the ``quality'' of an essay - may be inconsistently interpreted due to the complexity or subjectivity of the criteria. Conversely, rubrics can ask for \emph{analytic} judgments, which decompose assessment criteria - for example, ``quality'' into ``fluency'' and ``organization''. While these rubrics can be edited to improve the individual accuracy of both human and automated scoring, this approach may result in disagreement between the two scores, or with the associated holistic judgment. Designing and deploying reliable autoraters requires understanding not just the relationship between human and autorater annotations but how that relationship changes as holistic or analytic judgments are elicited. The results indicate that rubric edits providing representative examples and additional context, and reducing positional bias in the rubric increased human-autorater agreement, while higher rubric complexity and conservative aggregation methods tended to decrease it. The findings from the automatic essay scoring and instruction-following evaluation domains suggest that practitioners should carefully analyze domain- and rubric-specific performance to move towards higher human-autorater agreement.

preprint2023arXiv

Screen Correspondence: Mapping Interchangeable Elements between UIs

Understanding user interface (UI) functionality is a useful yet challenging task for both machines and people. In this paper, we investigate a machine learning approach for screen correspondence, which allows reasoning about UIs by mapping their elements onto previously encountered examples with known functionality and properties. We describe and implement a model that incorporates element semantics, appearance, and text to support correspondence computation without requiring any labeled examples. Through a comprehensive performance evaluation, we show that our approach improves upon baselines by incorporating multi-modal properties of UIs. Finally, we show three example applications where screen correspondence facilitates better UI understanding for humans and machines: (i) instructional overlay generation, (ii) semantic UI element search, and (iii) automated interface testing.

preprint2022arXiv

Nonverbal Sound Detection for Disordered Speech

Voice assistants have become an essential tool for people with various disabilities because they enable complex phone- or tablet-based interactions without the need for fine-grained motor control, such as with touchscreens. However, these systems are not tuned for the unique characteristics of individuals with speech disorders, including many of those who have a motor-speech disorder, are deaf or hard of hearing, have a severe stutter, or are minimally verbal. We introduce an alternative voice-based input system which relies on sound event detection using fifteen nonverbal mouth sounds like "pop," "click," or "eh." This system was designed to work regardless of ones' speech abilities and allows full access to existing technology. In this paper, we describe the design of a dataset, model considerations for real-world deployment, and efforts towards model personalization. Our fully-supervised model achieves segment-level precision and recall of 88.6% and 88.4% on an internal dataset of 710 adults, while achieving 0.31 false positives per hour on aggressors such as speech. Five-shot personalization enables satisfactory performance in 84.5% of cases where the generic model fails.

preprint2022arXiv

Reflow: Automatically Improving Touch Interactions in Mobile Applications through Pixel-based Refinements

Touch is the primary way that users interact with smartphones. However, building mobile user interfaces where touch interactions work well for all users is a difficult problem, because users have different abilities and preferences. We propose a system, Reflow, which automatically applies small, personalized UI adaptations, called refinements -- to mobile app screens to improve touch efficiency. Reflow uses a pixel-based strategy to work with existing applications, and improves touch efficiency while minimally disrupting the design intent of the original application. Our system optimizes a UI by (i) extracting its layout from its screenshot, (ii) refining its layout, and (iii) re-rendering the UI to reflect these modifications. We conducted a user study with 10 participants and a heuristic evaluation with 6 experts and found that applications optimized by Reflow led to, on average, 9% faster selection time with minimal layout disruption. The results demonstrate that Reflow's refinements useful UI adaptations to improve touch interactions.

preprint2022arXiv

Target-Guided Dialogue Response Generation Using Commonsense and Data Augmentation

Target-guided response generation enables dialogue systems to smoothly transition a conversation from a dialogue context toward a target sentence. Such control is useful for designing dialogue systems that direct a conversation toward specific goals, such as creating non-obtrusive recommendations or introducing new topics in the conversation. In this paper, we introduce a new technique for target-guided response generation, which first finds a bridging path of commonsense knowledge concepts between the source and the target, and then uses the identified bridging path to generate transition responses. Additionally, we propose techniques to re-purpose existing dialogue datasets for target-guided generation. Experiments reveal that the proposed techniques outperform various baselines on this task. Finally, we observe that the existing automated metrics for this task correlate poorly with human judgement ratings. We propose a novel evaluation metric that we demonstrate is more reliable for target-guided response evaluation. Our work generally enables dialogue system designers to exercise more control over the conversations that their systems produce.

preprint2021arXiv

Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels

Many accessibility features available on mobile platforms require applications (apps) to provide complete and accurate metadata describing user interface (UI) components. Unfortunately, many apps do not provide sufficient metadata for accessibility features to work as expected. In this paper, we explore inferring accessibility metadata for mobile apps from their pixels, as the visual interfaces often best reflect an app's full functionality. We trained a robust, fast, memory-efficient, on-device model to detect UI elements using a dataset of 77,637 screens (from 4,068 iPhone apps) that we collected and annotated. To further improve UI detections and add semantic information, we introduced heuristics (e.g., UI grouping and ordering) and additional models (e.g., recognize UI content, state, interactivity). We built Screen Recognition to generate accessibility metadata to augment iOS VoiceOver. In a study with 9 screen reader users, we validated that our approach improves the accessibility of existing mobile apps, enabling even previously inaccessible apps to be used.

preprint2021arXiv

SEP-28k: A Dataset for Stuttering Event Detection From Podcasts With People Who Stutter

The ability to automatically detect stuttering events in speech could help speech pathologists track an individual's fluency over time or help improve speech recognition systems for people with atypical speech patterns. Despite increasing interest in this area, existing public datasets are too small to build generalizable dysfluency detection systems and lack sufficient annotations. In this work, we introduce Stuttering Events in Podcasts (SEP-28k), a dataset containing over 28k clips labeled with five event types including blocks, prolongations, sound repetitions, word repetitions, and interjections. Audio comes from public podcasts largely consisting of people who stutter interviewing other people who stutter. We benchmark a set of acoustic models on SEP-28k and the public FluencyBank dataset and highlight how simply increasing the amount of training data improves relative detection performance by 28\% and 24\% F1 on each. Annotations from over 32k clips across both datasets will be publicly released.

preprint2020arXiv

Extracting Structured Data from Physician-Patient Conversations By Predicting Noteworthy Utterances

Despite diverse efforts to mine various modalities of medical data, the conversations between physicians and patients at the time of care remain an untapped source of insights. In this paper, we leverage this data to extract structured information that might assist physicians with post-visit documentation in electronic health records, potentially lightening the clerical burden. In this exploratory study, we describe a new dataset consisting of conversation transcripts, post-visit summaries, corresponding supporting evidence (in the transcript), and structured labels. We focus on the tasks of recognizing relevant diagnoses and abnormalities in the review of organ systems (RoS). One methodological challenge is that the conversations are long (around 1500 words), making it difficult for modern deep-learning models to use them as input. To address this challenge, we extract noteworthy utterances---parts of the conversation likely to be cited as evidence supporting some summary sentence. We find that by first filtering for (predicted) noteworthy utterances, we can significantly boost predictive performance for recognizing both diagnoses and RoS abnormalities.

preprint2019arXiv

Predicting risk of dyslexia with an online gamified test

Dyslexia is a specific learning disorder related to school failure. Detection is both crucial and challenging, especially in languages with transparent orthographies, such as Spanish. To make detecting dyslexia easier, we designed an online gamified test and a predictive machine learning model. In a study with more than 3,600 participants, our model correctly detected over 80% of the participants with dyslexia. To check the robustness of the method we tested our method using a new data set with over 1,300 participants with age customized tests in a different environment -- a tablet instead of a desktop computer -- reaching a recall of over 72% for the class with dyslexia for children 9 years old or older. Our work shows that dyslexia can be screened using a machine learning approach. An online screening tool based on our methods has already been used by more than 200,000 people.

preprint2015arXiv

WearWrite: Orchestrating the Crowd to Complete Complex Tasks from Wearables (We Wrote This Paper on a Watch)

In this paper we introduce a paradigm for completing complex tasks from wearable devices by leveraging crowdsourcing, and demonstrate its validity for academic writing. We explore this paradigm using a collaborative authoring system, called WearWrite, which is designed to enable authors and crowd workers to work together using an Android smartwatch and Google Docs to produce academic papers, including this one. WearWrite allows expert authors who do not have access to large devices to contribute bits of expertise and big picture direction from their watch, while freeing them of the obligation of integrating their contributions into the overall document. Crowd workers on desktop computers actually write the document. We used this approach to write several simple papers, and found it was effective at producing reasonable drafts. However, the workers often needed more structure and the authors more context. WearWrite addresses these issues by focusing workers on specific tasks and providing select context to authors on the watch. We demonstrate the system's feasibility by writing this paper using it.

preprint2014arXiv

Tuning the Diversity of Open-Ended Responses from the Crowd

Crowdsourcing can solve problems that current fully automated systems cannot. Its effectiveness depends on the reliability, accuracy, and speed of the crowd workers that drive it. These objectives are frequently at odds with one another. For instance, how much time should workers be given to discover and propose new solutions versus deliberate over those currently proposed? How do we determine if discovering a new answer is appropriate at all? And how do we manage workers who lack the expertise or attention needed to provide useful input to a given task? We present a mechanism that uses distinct payoffs for three possible worker actions---propose,vote, or abstain---to provide workers with the necessary incentives to guarantee an effective (or even optimal) balance between searching for new answers, assessing those currently available, and, when they have insufficient expertise or insight for the task at hand, abstaining. We provide a novel game theoretic analysis for this mechanism and test it experimentally on an image---labeling problem and show that it allows a system to reliably control the balance betweendiscovering new answers and converging to existing ones.

preprint2012arXiv

Crowd Memory: Learning in the Collective

Crowd algorithms often assume workers are inexperienced and thus fail to adapt as workers in the crowd learn a task. These assumptions fundamentally limit the types of tasks that systems based on such algorithms can handle. This paper explores how the crowd learns and remembers over time in the context of human computation, and how more realistic assumptions of worker experience may be used when designing new systems. We first demonstrate that the crowd can recall information over time and discuss possible implications of crowd memory in the design of crowd algorithms. We then explore crowd learning during a continuous control task. Recent systems are able to disguise dynamic groups of workers as crowd agents to support continuous tasks, but have not yet considered how such agents are able to learn over time. We show, using a real-time gaming setting, that crowd agents can learn over time, and `remember' by passing strategies from one generation of workers to the next, despite high turnover rates in the workers comprising them. We conclude with a discussion of future research directions for crowd memory and learning.

Jeffrey P. Bigham

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement

Screen Correspondence: Mapping Interchangeable Elements between UIs

Nonverbal Sound Detection for Disordered Speech

Reflow: Automatically Improving Touch Interactions in Mobile Applications through Pixel-based Refinements

Target-Guided Dialogue Response Generation Using Commonsense and Data Augmentation

Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels

SEP-28k: A Dataset for Stuttering Event Detection From Podcasts With People Who Stutter

Extracting Structured Data from Physician-Patient Conversations By Predicting Noteworthy Utterances

Predicting risk of dyslexia with an online gamified test

WearWrite: Orchestrating the Crowd to Complete Complex Tasks from Wearables (We Wrote This Paper on a Watch)

Tuning the Diversity of Open-Ended Responses from the Crowd

Crowd Memory: Learning in the Collective