Researcher profile

Daniel McDuff

Daniel McDuff contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2026arXiv

Evaluating the Utility of Personal Health Records in Personalized Health AI

Patient-managed Personal Health Records (PHRs) promises to empower patients to better understand their health; but information in the record is complex, potentially hindering insights. In this study, we assess the potential of large language models (LLMs, Gemini 3.0 Flash) to provide helpful answers to user health queries, when provided clinical data from PHRs as context. A total of 2,257 user queries were drawn from 3 different distributions to represent patient questions: shorter web search queries, longer questions derived from templates of chatbot conversations, and questions patients asked to their healthcare team (patient calls). Queries were matched with de-identified PHRs (from a pool of 1,945). Gemini responses were generated (1) without PHR context; (2) with a basic summary of demographics, conditions, and medications; (3) with full, extensive clinical notes. For evaluation, we leveraged an existing rating framework (SHARP), and developed a new framework for specific error modes when interpreting PHRs. Evaluation was performed using autoraters for the full set, and with clinician ratings for a subset (n=95), with both sets of raters knowing the full PHR context. We see significant improvements in the helpfulness of answers to all question types with PHR data (p < 0.001, paired t-test). We also observe potential gains in safety, accuracy, relevance and personalization of answers. Our PHR evaluation framework further identifies gaps in LLM understanding of particular aspects of complex PHRs, such as temporal disorientation, and rare but meaningful confabulations. These results suggest potential for PHR data to help people with a wide range of user needs; and provide a framework for monitoring for gaps in LLM answers based on PHR context. This study motivates further work to assess and realize potential benefits to users from understanding their health records.

preprint2026arXiv

SymptomAI: Toward a Conversational AI Agent for Everyday Symptom Assessment

Language models excel at diagnostic assessments on curated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.56, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.

preprint2026arXiv

TeamBench: Evaluating Agent Coordination under Enforced Role Separation

Agent systems often decompose a task across multiple roles, but these roles are typically specified by prompts rather than enforced by access controls. Without enforcement, a team pass rate can mask whether agents actually coordinated or whether one role effectively did another role's work. We present TeamBench, a benchmark with 851 task templates and 931 seeded instances for evaluating agent coordination under operating system-enforced role separation. TeamBench separates specification access, workspace editing, and final certification across Planner, Executor, and Verifier roles, so that no role can read the full requirements, modify the workspace, and certify the final answer. Prompt-only and sandbox-enforced teams reach statistically indistinguishable pass rates, but prompt-only runs produce 3.6 times more cases where the verifier attempts to edit the executor's code. Verifiers approve 49% of submissions that fail the deterministic grader, and removing the verifier improves mean partial score in the ablation. Team value is also conditional. Teams benefit when single agents struggle, but hurt when single agents already perform well. A 40-session human study under the same role separation shows that our benchmark exposes interaction patterns that pass rate misses. Solo participants work through the task directly, human participants paired with agents often collapse into quick approval, and human teams spend more effort coordinating missing information across roles.

preprint2026arXiv

Uneven Evolution of Cognition Across Generations of Generative AI Models

The pursuit of artificial general intelligence necessitates robust methods for evaluating the cognitive capabilities of models beyond narrow task performance. Here, we introduce a psychometric framework to assess the cognitive profiles of generative AI, comparing them to human norms and tracking their evolution across generations. Initial evaluation of leading multimodal models using tasks adapted from the Wechsler Adult Intelligence Scale revealed a profoundly uneven cognitive architecture: near-ceiling performance in verbal comprehension and working memory (>$98^{\text{th}}$ percentile) contrasted with near-floor performance in perceptual reasoning (<$1^{\text{st}}$ percentile). To track developmental trajectories beyond human-normed limits, we developed the Artificial Intelligence Quotient (AIQ) Benchmark and applied it to six generations and two model families, revealing significant but asymmetric performance gains. Notably, we uncovered a sharp dissociation between modalities; abstract quantitative reasoning matured far more rapidly when presented linguistically compared to a visually analogous format, indicating an architectural bias towards language-based symbolic manipulation. While abstract visual reasoning improved, visual-perceptual organization remained largely stagnant. Collectively, these findings demonstrate that the cognitive abilities of generative models are evolving unevenly, suggesting that scaling and optimization approaches to AGI development alone may be insufficient to overcome fundamental architectural limitations in achieving balanced, human-like general intelligence.

preprint2026arXiv

WavesFM: Hierarchical Representation Learning for Longitudinal Wearable Sensor Waveforms

Wearable sensors enable the continuous acquisition of high-resolution physiological waveforms, such as photoplethysmography and accelerometry, under free-living conditions. However, inferring health-related phenotypes from these signals presents significant challenges due to high sampling frequencies, multimodal dependencies, and extreme sequence lengths (e.g., weeks of recordings), compounded by a scarcity of ground-truth labels. To address these challenges, existing self-supervised learning (SSL) methodologies typically follow two paradigms: (1) learning rich morphological representations from short waveform segments while collapsing longitudinal dynamics through simple aggregation, or (2) modeling behavioral patterns from coarse, hand-crafted features (e.g. heart rate, step counts) spanning longer horizons but foregoing subtle, predictive signatures in raw waveforms. To bridge this gap, we propose WavesFM, a foundation model utilizing a two-stage SSL framework for longitudinal physiological data. Specifically, we decompose the learning problem into two stages: first, a segment-level encoder is pretrained to extract local embeddings from short waveforms; subsequently, a temporal encoder is trained to model the sequence of these embeddings across a multi-day horizon. This hierarchical approach overcomes the computational complexity of high-resolution, long-sequence data, allowing the overall model to capture both local signal semantics and the complex circadian and inter-day variations governing physiological dynamics. Pretrained on over 6.8M hours (N=324k individuals) of recordings for the first stage and 5.3M hours (N=10k) for the second stage, WavesFM demonstrates superior performance across 58 diverse tasks spanning demographics, lifestyle, health conditions, and medications.

preprint2022arXiv

COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems

Learning representations that generalize across tasks and domains is challenging yet necessary for autonomous systems. Although task-driven approaches are appealing, designing models specific to each application can be difficult in the face of limited data, especially when dealing with highly variable multimodal input spaces arising from different tasks in different environments.We introduce the first general-purpose pretraining pipeline, COntrastive Multimodal Pretraining for AutonomouS Systems (COMPASS), to overcome the limitations of task-specific models and existing pretraining approaches. COMPASS constructs a multimodal graph by considering the essential information for autonomous systems and the properties of different modalities. Through this graph, multimodal signals are connected and mapped into two factorized spatio-temporal latent spaces: a &#34;motion pattern space&#34; and a &#34;current state space.&#34; By learning from multimodal correspondences in each latent space, COMPASS creates state representations that models necessary information such as temporal dynamics, geometry, and semantics. We pretrain COMPASS on a large-scale multimodal simulation dataset TartanAir \cite{tartanair2020iros} and evaluate it on drone navigation, vehicle racing, and visual odometry tasks. The experiments indicate that COMPASS can tackle all three scenarios and can also generalize to unseen environments and real-world data.

preprint2022arXiv

DOC2PPT: Automatic Presentation Slides Generation from Scientific Documents

Creating presentation materials requires complex multimodal reasoning skills to summarize key concepts and arrange them in a logical and visually pleasing manner. Can machines learn to emulate this laborious process? We present a novel task and approach for document-to-slide generation. Solving this involves document summarization, image and text retrieval, slide structure and layout prediction to arrange key elements in a form suitable for presentation. We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner. Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides. To help accelerate research in this domain, we release a dataset about 6K paired documents and slide decks used in our experiments. We show that our approach outperforms strong baselines and produces slides with rich content and aligned imagery.

preprint2022arXiv

Federated Remote Physiological Measurement with Imperfect Data

The growing need for technology that supports remote healthcare is being acutely highlighted by an aging population and the COVID-19 pandemic. In health-related machine learning applications the ability to learn predictive models without data leaving a private device is attractive, especially when these data might contain features (e.g., photographs or videos of the body) that make identifying a subject trivial and/or the training data volume is large (e.g., uncompressed video). Camera-based remote physiological sensing facilitates scalable and low-cost measurement, but is a prime example of a task that involves analysing high bit-rate videos containing identifiable images and sensitive health information. Federated learning enables privacy-preserving decentralized training which has several properties beneficial for camera-based sensing. We develop the first mobile federated learning camera-based sensing system and show that it can perform competitively with traditional state-of-the-art supervised approaches. However, in the presence of corrupted data (e.g., video or label noise) from a few devices the performance of weight averaging quickly degrades. To address this, we leverage knowledge about the expected noise profile within the video to intelligently adjust how the model weights are averaged on the server. Our results show that this significantly improves upon the robustness of models even when the signal-to-noise ratio is low

preprint2022arXiv

Learning Higher-Order Dynamics in Video-Based Cardiac Measurement

Computer vision methods typically optimize for first-order dynamics (e.g., optical flow). However, in many cases the properties of interest are subtle variations in higher-order changes, such as acceleration. This is true in the cardiac pulse, where the second derivative can be used as an indicator of blood pressure and arterial disease. Recent developments in camera-based vital sign measurement have shown that cardiac measurements can be recovered with impressive accuracy from videos; however, most of the research has focused on extracting summary statistics such as heart rate. Less emphasis has been put on the accuracy of waveform morphology that is necessary for many clinically meaningful assessments. In this work, we provide evidence that higher-order dynamics are better estimated by neural models when explicitly optimized for in the loss function. Furthermore, adding second-derivative inputs also improves performance when estimating second-order dynamics. We illustrate this, by showing that incorporating the second derivative of both the input frames and the target vital sign signals into the training procedure, models are better able to estimate left ventricle ejection time (LVET) intervals.

preprint2022arXiv

MobilePhys: Personalized Mobile Camera-Based Contactless Physiological Sensing

Camera-based contactless photoplethysmography refers to a set of popular techniques for contactless physiological measurement. The current state-of-the-art neural models are typically trained in a supervised manner using videos accompanied by gold standard physiological measurements. However, they often generalize poorly out-of-domain examples (i.e., videos that are unlike those in the training set). Personalizing models can help improve model generalizability, but many personalization techniques still require some gold standard data. To help alleviate this dependency, in this paper, we present a novel mobile sensing system called MobilePhys, the first mobile personalized remote physiological sensing system, that leverages both front and rear cameras on a smartphone to generate high-quality self-supervised labels for training personalized contactless camera-based PPG models. To evaluate the robustness of MobilePhys, we conducted a user study with 39 participants who completed a set of tasks under different mobile devices, lighting conditions/intensities, motion tasks, and skin types. Our results show that MobilePhys significantly outperforms the state-of-the-art on-device supervised training and few-shot adaptation methods. Through extensive user studies, we further examine how does MobilePhys perform in complex real-world settings. We envision that calibrated or personalized camera-based contactless PPG models generated from our proposed dual-camera mobile sensing system will open the door for numerous future applications such as smart mirrors, fitness and mobile health applications.

preprint2022arXiv

SCAMPS: Synthetics for Camera Measurement of Physiological Signals

The use of cameras and computational algorithms for noninvasive, low-cost and scalable measurement of physiological (e.g., cardiac and pulmonary) vital signs is very attractive. However, diverse data representing a range of environments, body motions, illumination conditions and physiological states is laborious, time consuming and expensive to obtain. Synthetic data have proven a valuable tool in several areas of machine learning, yet are not widely available for camera measurement of physiological states. Synthetic data offer &#34;perfect&#34; labels (e.g., without noise and with precise synchronization), labels that may not be possible to obtain otherwise (e.g., precise pixel level segmentation maps) and provide a high degree of control over variation and diversity in the dataset. We present SCAMPS, a dataset of synthetics containing 2,800 videos (1.68M frames) with aligned cardiac and respiratory signals and facial action intensities. The RGB frames are provided alongside segmentation maps. We provide precise descriptive statistics about the underlying waveforms, including inter-beat interval, heart rate variability, and pulse arrival time. Finally, we present baseline results training on these synthetic data and testing on real-world datasets to illustrate generalizability.

preprint2021arXiv

AffectiveSpotlight: Facilitating the Communication of Affective Responses from Audience Members during Online Presentations

The ability to monitor audience reactions is critical when delivering presentations. However, current videoconferencing platforms offer limited solutions to support this. This work leverages recent advances in affect sensing to capture and facilitate communication of relevant audience signals. Using an exploratory survey (N = 175), we assessed the most relevant audience responses such as confusion, engagement, and head-nods. We then implemented AffectiveSpotlight, a Microsoft Teams bot that analyzes facial responses and head gestures of audience members and dynamically spotlights the most expressive ones. In a within-subjects study with 14 groups (N = 117), we observed that the system made presenters significantly more aware of their audience, speak for a longer period of time, and self-assess the quality of their talk more similarly to the audience members, compared to two control conditions (randomly-selected spotlight and default platform UI). We provide design recommendations for future affective interfaces for online presentations based on feedback from the study.

preprint2021arXiv

MetaPhys: Few-Shot Adaptation for Non-Contact Physiological Measurement

There are large individual differences in physiological processes, making designing personalized health sensing algorithms challenging. Existing machine learning systems struggle to generalize well to unseen subjects or contexts and can often contain problematic biases. Video-based physiological measurement is not an exception. Therefore, learning personalized or customized models from a small number of unlabeled samples is very attractive as it would allow fast calibrations to improve generalization and help correct biases. In this paper, we present a novel meta-learning approach called MetaPhys for personalized video-based cardiac measurement for contactless pulse and heart rate monitoring. Our method uses only 18-seconds of video for customization and works effectively in both supervised and unsupervised manners. We evaluate our proposed approach on two benchmark datasets and demonstrate superior performance in cross-dataset evaluation with substantial reductions (42% to 44%) in errors compared with state-of-the-art approaches. We have also demonstrated our proposed method significantly helps reduce the bias in skin type.

preprint2021arXiv

Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement

Telehealth and remote health monitoring have become increasingly important during the SARS-CoV-2 pandemic and it is widely expected that this will have a lasting impact on healthcare practices. These tools can help reduce the risk of exposing patients and medical staff to infection, make healthcare services more accessible, and allow providers to see more patients. However, objective measurement of vital signs is challenging without direct contact with a patient. We present a video-based and on-device optical cardiopulmonary vital sign measurement approach. It leverages a novel multi-task temporal shift convolutional attention network (MTTS-CAN) and enables real-time cardiovascular and respiratory measurements on mobile platforms. We evaluate our system on an Advanced RISC Machine (ARM) CPU and achieve state-of-the-art accuracy while running at over 150 frames per second which enables real-time applications. Systematic experimentation on large benchmark datasets reveals that our approach leads to substantial (20%-50%) reductions in error and generalizes well across datasets.