Researcher profile

Oisin Mac Aodha

Oisin Mac Aodha contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2026arXiv

Deep learning-based ecological analysis of camera trap images is impacted by training data quality and quantity

Large image collections generated from camera traps offer valuable insights into species richness, occupancy, and activity patterns, significantly aiding biodiversity monitoring. However, the manual processing of these datasets is time-consuming, hindering analytical processes. To address this, deep neural networks have been widely adopted to automate image labelling, but the impact of classification error on key ecological metrics remains unclear. Here, we analyse data from camera trap collections in an African savannah (82,300 labelled images, 47 species) and an Asian sub-tropical dry forest (40,308 labelled images, 29 species) to compare ecological metrics derived from expert-generated species identifications with those generated by deep learning classification models. We specifically assess the impact of deep learning model architecture, proportion of label noise in the training data, and the size of the training dataset on three key ecological metrics: species richness, occupancy, and activity patterns. We found that predictions of species richness derived from deep neural networks closely match those calculated from expert labels and remained resilient to up to 10% noise in the training dataset (mis-labelled images) and a 50% reduction in the training dataset size. We found that our choice of deep learning model architecture (ResNet vs ConvNext-T) or depth (ResNet18, 50, 101) did not impact predicted ecological metrics. In contrast, species-specific metrics were more sensitive; less common and visually similar species were disproportionately affected by a reduction in deep neural network accuracy, with consequences for occupancy and diel activity pattern estimates. To ensure the reliability of their findings, practitioners should prioritize creating large, clean training sets and account for class imbalance across species over exploring numerous deep learning model architectures.

preprint2026arXiv

MotionPhysics: Learnable Motion Distillation for Text-Guided Simulation

Accurately simulating existing 3D objects and a wide variety of materials often demands expert knowledge and time-consuming physical parameter tuning to achieve the desired dynamic behavior. We introduce MotionPhysics, an end-to-end differentiable framework that infers plausible physical parameters from a user-provided natural language prompt for a chosen 3D scene of interest, removing the need for guidance from ground-truth trajectories or annotated videos. Our approach first utilizes a multimodal large language model to estimate material parameter values, which are constrained to lie within plausible ranges. We further propose a learnable motion distillation loss that extracts robust motion priors from pretrained video diffusion models while minimizing appearance and geometry inductive biases to guide the simulation. We evaluate MotionPhysics across more than thirty scenarios, including real-world, human-designed, and AI-generated 3D objects, spanning a wide range of materials such as elastic solids, metals, foams, sand, and both Newtonian and non-Newtonian fluids. We demonstrate that MotionPhysics produces visually realistic dynamic simulations guided by natural language, surpassing the state of the art while automatically determining physically plausible parameters. The code and project page are available at: https://wangmiaowei.github.io/MotionPhysics.github.io/.

preprint2022arXiv

Demystifying Unsupervised Semantic Correspondence Estimation

We explore semantic correspondence estimation through the lens of unsupervised learning. We thoroughly evaluate several recently proposed unsupervised methods across multiple challenging datasets using a standardized evaluation protocol where we vary factors such as the backbone architecture, the pre-training strategy, and the pre-training and finetuning datasets. To better understand the failure modes of these methods, and in order to provide a clearer path for improvement, we provide a new diagnostic framework along with a new performance metric that is better suited to the semantic matching task. Finally, we introduce a new unsupervised correspondence approach which utilizes the strength of pre-trained features while encouraging better matches during training. This results in significantly better matching performance compared to current state-of-the-art methods.

preprint2022arXiv

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for advancing research on audiovisual fine-grained categorization. While our community has made great strides in fine-grained visual categorization on images, the counterparts in audio and video fine-grained categorization are relatively unexplored. To encourage advancements in this space, we have carefully constructed the SSW60 dataset to enable researchers to experiment with classifying the same set of categories in three different modalities: images, audio, and video. The dataset covers 60 species of birds and is comprised of images from existing datasets, and brand new, expert-curated audio and video datasets. We thoroughly benchmark audiovisual classification performance and modality fusion experiments through the use of state-of-the-art transformer methods. Our findings show that performance of audiovisual fusion methods is better than using exclusively image or audio based methods for the task of video classification. We also present interesting modality transfer experiments, enabled by the unique construction of SSW60 to encompass three different modalities. We hope the SSW60 dataset and accompanying baselines spur research in this fascinating area.

preprint2022arXiv

On Label Granularity and Object Localization

Weakly supervised object localization (WSOL) aims to learn representations that encode object location using only image-level category labels. However, many objects can be labeled at different levels of granularity. Is it an animal, a bird, or a great horned owl? Which image-level labels should we use? In this paper we study the role of label granularity in WSOL. To facilitate this investigation we introduce iNatLoc500, a new large-scale fine-grained benchmark dataset for WSOL. Surprisingly, we find that choosing the right training label granularity provides a much larger performance boost than choosing the best WSOL algorithm. We also show that changing the label granularity can significantly improve data efficiency.

preprint2022arXiv

Visual Knowledge Tracing

Each year, thousands of people learn new visual categorization tasks -- radiologists learn to recognize tumors, birdwatchers learn to distinguish similar species, and crowd workers learn how to annotate valuable data for applications like autonomous driving. As humans learn, their brain updates the visual features it extracts and attend to, which ultimately informs their final classification decisions. In this work, we propose a novel task of tracing the evolving classification behavior of human learners as they engage in challenging visual classification tasks. We propose models that jointly extract the visual features used by learners as well as predicting the classification functions they utilize. We collect three challenging new datasets from real human learners in order to evaluate the performance of different visual knowledge tracing methods. Our results show that our recurrent models are able to predict the classification behavior of human learners on three challenging medical image and species identification tasks.

preprint2022arXiv

When Does Contrastive Visual Representation Learning Work?

Recent self-supervised representation learning techniques have largely closed the gap between supervised and unsupervised learning on ImageNet classification. While the particulars of pretraining on ImageNet are now relatively well understood, the field still lacks widely accepted best practices for replicating this success on other datasets. As a first step in this direction, we study contrastive self-supervised learning on four diverse large-scale datasets. By looking through the lenses of data quantity, data domain, data quality, and task granularity, we provide new insights into the necessary conditions for successful self-supervised learning. Our key findings include observations such as: (i) the benefit of additional pretraining data beyond 500k images is modest, (ii) adding pretraining images from another domain does not lead to more general representations, (iii) corrupted pretraining images have a disparate impact on supervised and self-supervised pretraining, and (iv) contrastive learning lags far behind supervised learning on fine-grained visual classification tasks.

preprint2020arXiv

Geocoding of trees from street addresses and street-level images

We introduce an approach for updating older tree inventories with geographic coordinates using street-level panorama images and a global optimization framework for tree instance matching. Geolocations of trees in inventories until the early 2000s where recorded using street addresses whereas newer inventories use GPS. Our method retrofits older inventories with geographic coordinates to allow connecting them with newer inventories to facilitate long-term studies on tree mortality etc. What makes this problem challenging is the different number of trees per street address, the heterogeneous appearance of different tree instances in the images, ambiguous tree positions if viewed from multiple images and occlusions. To solve this assignment problem, we (i) detect trees in Google street-view panoramas using deep learning, (ii) combine multi-view detections per tree into a single representation, (iii) and match detected trees with given trees per street address with a global optimization approach. Experiments for > 50000 trees in 5 cities in California, USA, show that we are able to assign geographic coordinates to 38 % of the street trees, which is a good starting point for long-term studies on the ecosystem services value of street trees at large scale.

preprint2020arXiv

Learning Stereo from Single Images

Supervised deep networks are among the best methods for finding correspondences in stereo image pairs. Like all supervised approaches, these networks require ground truth data during training. However, collecting large quantities of accurate dense correspondence data is very challenging. We propose that it is unnecessary to have such a high reliance on ground truth depths or even corresponding stereo pairs. Inspired by recent progress in monocular depth estimation, we generate plausible disparity maps from single images. In turn, we use those flawed disparity maps in a carefully designed pipeline to generate stereo training pairs. Training in this manner makes it possible to convert any collection of single RGB images into stereo training data. This results in a significant reduction in human effort, with no need to collect real depths or to hand-design synthetic data. We can consequently train a stereo matching network from scratch on datasets like COCO, which were previously hard to exploit for stereo. Through extensive experiments we show that our approach outperforms stereo networks trained with standard synthetic datasets, when evaluated on KITTI, ETH3D, and Middlebury.