Source author record

Matthew McDermott

Matthew McDermott appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computation and Language Computer Vision cs.CY Databases Digital Libraries Information Retrieval Robotics

Catalog footprint

What is connected

3works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Croissant Baker: Metadata Generation for Discoverable, Governable, and Reusable ML Datasets

Croissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON-LD-based format that makes dataset discovery, automated ingestion, and reproducible analysis machine-checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high-value data ML increasingly relies on. We release Croissant Baker, a local-first, open-source command-line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC-IV at 886 million rows and 374 Parquet files. On held-out comparisons against producer-authored or standards-derived ground truth, Croissant Baker reaches 97-100% agreement across multiple domains.

preprint2022arXiv

Enhanced Laser-Scan Matching with Online Error Estimation for Highway and Tunnel Driving

Lidar data can be used to generate point clouds for the navigation of autonomous vehicles or mobile robotics platforms. Scan matching, the process of estimating the rigid transformation that best aligns two point clouds, is the basis for lidar odometry, a form of dead reckoning. Lidar odometry is particularly useful when absolute sensors, like GPS, are not available. Here we propose the Iterative Closest Ellipsoidal Transform (ICET), a scan matching algorithm which provides two novel improvements over the current state-of-the-art Normal Distributions Transform (NDT). Like NDT, ICET decomposes lidar data into voxels and fits a Gaussian distribution to the points within each voxel. The first innovation of ICET reduces geometric ambiguity along large flat surfaces by suppressing the solution along those directions. The second innovation of ICET is to infer the output error covariance associated with the position and orientation transformation between successive point clouds; the error covariance is particularly useful when ICET is incorporated into a state-estimation routine such as an extended Kalman filter. We constructed a simulation to compare the performance of ICET and NDT in 2D space both with and without geometric ambiguity and found that ICET produces superior estimates while accurately predicting solution accuracy.

preprint2020arXiv

Hurtful Words: Quantifying Biases in Clinical Contextual Word Embeddings

In this work, we examine the extent to which embeddings may encode marginalized populations differently, and how this may lead to a perpetuation of biases and worsened performance on clinical tasks. We pretrain deep embedding models (BERT) on medical notes from the MIMIC-III hospital dataset, and quantify potential disparities using two approaches. First, we identify dangerous latent relationships that are captured by the contextual word embeddings using a fill-in-the-blank method with text from real clinical notes and a log probability bias score quantification. Second, we evaluate performance gaps across different definitions of fairness on over 50 downstream clinical prediction tasks that include detection of acute and chronic conditions. We find that classifiers trained from BERT representations exhibit statistically significant differences in performance, often favoring the majority group with regards to gender, language, ethnicity, and insurance status. Finally, we explore shortcomings of using adversarial debiasing to obfuscate subgroup information in contextual word embeddings, and recommend best practices for such deep embedding models in clinical settings.