Researcher profile

Sarah Ostadabbas

Sarah Ostadabbas contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2026arXiv

PanoWorld: Geometry-Consistent Panoramic Video World Modeling

We present PanoWorld, a panoramic video world model that generates geometry-consistent 360$\degree$ video from a single image and a caption. Existing panoramic video methods optimize primarily for visual realism and do not explicitly constrain the underlying 3D scene state, producing outputs that appear plausible yet exhibit inconsistent depth, broken correspondences, and implausible motion across the spherical surface. We address this gap by framing panoramic video generation as a geometry- and dynamics-consistent latent state modeling problem rather than pure visual synthesis. Building on a pre-trained perspective video world model, we introduce two lightweight regularizers: a depth consistency loss against pseudo ground-truth panoramic depth, and a trajectory consistency loss that supervises the 3D world-frame positions of tracked points across time. We further apply spherical-geometry-aware adaptation to the conditioning and positional encoding. We additionally introduce PanoGeo, a unified geometry-aware panoramic video dataset with consistent depth, trajectory, and prompt annotations across diverse real and synthetic sources, used for both training and stratified evaluation. Experiments show that PanoWorld improves geometric consistency over prior panoramic generation methods while maintaining competitive visual realism, establishing that panoramic video generation must be treated as a geometric modeling problem to support the holistic spatial understanding requirements of embodied AI applications. Code is available at https://github.com/ostadabbas/PanoWorld.

preprint2026arXiv

PhyGround: Benchmarking Physical Reasoning in Generative World Models

Generative world models are increasingly used for video generation, where learned simulators are expected to capture the physical rules that govern real-world dynamics. However, evaluating whether generated videos actually follow these rules remains challenging. Existing physics-focused video benchmarks have made important progress, but they still face three key challenges, including the coarse evaluation frameworks that hide law-specific failures, response biases and fatigue that undermine the validity of annotation judgments, and automated evaluators that are insufficiently physics-aware or difficult to audit. To address those challenges, we introduce PhyGround, a criteria-grounded benchmark for evaluating physical reasoning in video generation. The benchmark contains 250 curated prompts, each augmented with an expected physical outcome, and a taxonomy of 13 physical laws across solid-body mechanics, fluid dynamics, and optics. Each law is operationalized through observable sub-questions to enable per-law diagnostics. We evaluate eight modern video generation models through a large-scale, quality-controlled human study, grounded on social science lab experiment design. A total of 459 annotators provided 5,796 complete annotations and over 37.4K fine-grained labels; after quality control, the retained annotations exhibited high split-half model-ranking correlations (Spearman's rho > 0.90). To support reproducible automated evaluation, we release PhyJudge-9B, an open physics-specialized VLM judge. PhyJudge-9B achieves substantially lower aggregate relative bias than Gemini-3.1-Pro (3.3% vs. 16.6%). We release prompts, human annotations, model checkpoints, and evaluation code on the project page https://phyground.github.io/.

preprint2022arXiv

Adapted Human Pose: Monocular 3D Human Pose Estimation with Zero Real 3D Pose Data

The ultimate goal for an inference model is to be robust and functional in real life applications. However, training vs. test data domain gaps often negatively affect model performance. This issue is especially critical for the monocular 3D human pose estimation problem, in which 3D human data is often collected in a controlled lab setting. In this paper, we focus on alleviating the negative effect of domain shift in both appearance and pose space for 3D human pose estimation by presenting our adapted human pose (AHuP) approach. AHuP is built upon two key components: (1) semantically aware adaptation (SAA) for the cross-domain feature space adaptation, and (2) skeletal pose adaptation (SPA) for the pose space adaptation which takes only limited information from the target domain. By using zero real 3D human pose data, one of our adapted synthetic models shows comparable performance with the SOTA pose estimation models trained with large scale real 3D human datasets. The proposed SPA can be also employed independently as a light-weighted head to improve existing SOTA models in a novel context. A new 3D scan-based synthetic human dataset called ScanAva+ is also going to be publicly released with this work.

preprint2022arXiv

Computer Vision to the Rescue: Infant Postural Symmetry Estimation from Incongruent Annotations

Bilateral postural symmetry plays a key role as a potential risk marker for autism spectrum disorder (ASD) and as a symptom of congenital muscular torticollis (CMT) in infants, but current methods of assessing symmetry require laborious clinical expert assessments. In this paper, we develop a computer vision based infant symmetry assessment system, leveraging 3D human pose estimation for infants. Evaluation and calibration of our system against ground truth assessments is complicated by our findings from a survey of human ratings of angle and symmetry, that such ratings exhibit low inter-rater reliability. To rectify this, we develop a Bayesian estimator of the ground truth derived from a probabilistic graphical model of fallible human raters. We show that the 3D infant pose estimation model can achieve 68% area under the receiver operating characteristic curve performance in predicting the Bayesian aggregate labels, compared to only 61% from a 2D infant pose estimation model and 60% from a 3D adult pose estimation model, highlighting the importance of 3D poses and infant domain knowledge in assessing infant body symmetry. Our survey analysis also suggests that human ratings are susceptible to higher levels of bias and inconsistency, and hence our final 3D pose-based symmetry assessment system is calibrated but not directly supervised by Bayesian aggregate human ratings, yielding higher levels of consistency and lower levels of inter-limb assessment bias.

preprint2022arXiv

InfAnFace: Bridging the infant-adult domain gap in facial landmark estimation in the wild

We lay the groundwork for research in the algorithmic comprehension of infant faces, in anticipation of applications from healthcare to psychology, especially in the early prediction of developmental disorders. Specifically, we introduce the first-ever dataset of infant faces annotated with facial landmark coordinates and pose attributes, demonstrate the inadequacies of existing facial landmark estimation algorithms in the infant domain, and train new state-of-the-art models that significantly improve upon those algorithms using domain adaptation techniques. We touch on the closely related task of facial detection for infants, and also on a challenging case study of infrared baby monitor images gathered by our lab as part of in-field research into the aforementioned developmental issues.

preprint2022arXiv

Interpreting Face Inference Models using Hierarchical Network Dissection

This paper presents Hierarchical Network Dissection, a general pipeline to interpret the internal representation of face-centric inference models. Using a probabilistic formulation, our pipeline pairs units of the model with concepts in our "Face Dictionary", a collection of facial concepts with corresponding sample images. Our pipeline is inspired by Network Dissection, a popular interpretability model for object-centric and scene-centric models. However, our formulation allows to deal with two important challenges of face-centric models that Network Dissection cannot address: (1) spacial overlap of concepts: there are different facial concepts that simultaneously occur in the same region of the image, like "nose" (facial part) and "pointy nose" (facial attribute); and (2) global concepts: there are units with affinity to concepts that do not refer to specific locations of the face (e.g. apparent age). We use Hierarchical Network Dissection to dissect different face-centric inference models trained on widely-used facial datasets. The results show models trained for different tasks learned different internal representations. Furthermore, the interpretability results can reveal some biases in the training data and some interesting characteristics of the face-centric inference tasks. Finally, we conduct controlled experiments on biased data to showcase the potential of Hierarchical Network Dissection for bias discovery. The results illustrate how Hierarchical Network Dissection can be used to discover and quantify bias in the training data that is also encoded in the model.

preprint2022arXiv

Live Stream Temporally Embedded 3D Human Body Pose and Shape Estimation

3D Human body pose and shape estimation within a temporal sequence can be quite critical for understanding human behavior. Despite the significant progress in human pose estimation in the recent years, which are often based on single images or videos, human motion estimation on live stream videos is still a rarely-touched area considering its special requirements for real-time output and temporal consistency. To address this problem, we present a temporally embedded 3D human body pose and shape estimation (TePose) method to improve the accuracy and temporal consistency of pose estimation in live stream videos. TePose uses previous predictions as a bridge to feedback the error for better estimation in the current frame and to learn the correspondence between data frames and predictions in the history. A multi-scale spatio-temporal graph convolutional network is presented as the motion discriminator for adversarial training using datasets without any 3D labeling. We propose a sequential data loading strategy to meet the special start-to-end data processing requirement of live stream. We demonstrate the importance of each proposed module with extensive experiments. The results show the effectiveness of TePose on widely-used human pose benchmarks with state-of-the-art performance.

preprint2022arXiv

Pressure Eye: In-bed Contact Pressure Estimation via Contact-less Imaging

Computer vision has achieved great success in interpreting semantic meanings from images, yet estimating underlying (non-visual) physical properties of an object is often limited to their bulk values rather than reconstructing a dense map. In this work, we present our pressure eye (PEye) approach to estimate contact pressure between a human body and the surface she is lying on with high resolution from vision signals directly. PEye approach could ultimately enable the prediction and early detection of pressure ulcers in bed-bound patients, that currently depends on the use of expensive pressure mats. Our PEye network is configured in a dual encoding shared decoding form to fuse visual cues and some relevant physical parameters in order to reconstruct high resolution pressure maps (PMs). We also present a pixel-wise resampling approach based on Naive Bayes assumption to further enhance the PM regression performance. A percentage of correct sensing (PCS) tailored for sensing estimation accuracy evaluation is also proposed which provides another perspective for performance evaluation under varying error tolerances. We tested our approach via a series of extensive experiments using multimodal sensing technologies to collect data from 102 subjects while lying on a bed. The individual's high resolution contact pressure data could be estimated from their RGB or long wavelength infrared (LWIR) images with 91.8% and 91.2% estimation accuracies in $PCS_{efs0.1}$ criteria, superior to state-of-the-art methods in the related image regression/translation tasks.

preprint2022arXiv

Prior-Aware Synthetic Data to the Rescue: Animal Pose Estimation with Very Limited Real Data

Accurately annotated image datasets are essential components for studying animal behaviors from their poses. Compared to the number of species we know and may exist, the existing labeled pose datasets cover only a small portion of them, while building comprehensive large-scale datasets is prohibitively expensive. Here, we present a very data efficient strategy targeted for pose estimation in quadrupeds that requires only a small amount of real images from the target animal. It is confirmed that fine-tuning a backbone network with pretrained weights on generic image datasets such as ImageNet can mitigate the high demand for target animal pose data and shorten the training time by learning the the prior knowledge of object segmentation and keypoint estimation in advance. However, when faced with serious data scarcity (i.e., $<10^2$ real images), the model performance stays unsatisfactory, particularly for limbs with considerable flexibility and several comparable parts. We therefore introduce a prior-aware synthetic animal data generation pipeline called PASyn to augment the animal pose data essential for robust pose estimation. PASyn generates a probabilistically-valid synthetic pose dataset, SynAP, through training a variational generative model on several animated 3D animal models. In addition, a style transfer strategy is utilized to blend the synthetic animal image into the real backgrounds. We evaluate the improvement made by our approach with three popular backbone networks and test their pose estimation accuracy on publicly available animal pose images as well as collected from real animals in a zoo.

preprint2020arXiv

Deep Markov Spatio-Temporal Factorization

We introduce deep Markov spatio-temporal factorization (DMSTF), a generative model for dynamical analysis of spatio-temporal data. Like other factor analysis methods, DMSTF approximates high dimensional data by a product between time dependent weights and spatially dependent factors. These weights and factors are in turn represented in terms of lower dimensional latents inferred using stochastic variational inference. The innovation in DMSTF is that we parameterize weights in terms of a deep Markovian prior extendable with a discrete latent, which is able to characterize nonlinear multimodal temporal dynamics, and perform multidimensional time series forecasting. DMSTF learns a low dimensional spatial latent to generatively parameterize spatial factors or their functional forms in order to accommodate high spatial dimensionality. We parameterize the corresponding variational distribution using a bidirectional recurrent network in the low-level latent representations. This results in a flexible family of hierarchical deep generative factor analysis models that can be extended to perform time series clustering or perform factor analysis in the presence of a control signal. Our experiments, which include simulated and real-world data, demonstrate that DMSTF outperforms related methodologies in terms of predictive performance for unseen data, reveals meaningful clusters in the data, and performs forecasting in a variety of domains with potentially nonlinear temporal transitions.

preprint2020arXiv

Deep Switching Auto-Regressive Factorization:Application to Time Series Forecasting

We introduce deep switching auto-regressive factorization (DSARF), a deep generative model for spatio-temporal data with the capability to unravel recurring patterns in the data and perform robust short- and long-term predictions. Similar to other factor analysis methods, DSARF approximates high dimensional data by a product between time dependent weights and spatially dependent factors. These weights and factors are in turn represented in terms of lower dimensional latent variables that are inferred using stochastic variational inference. DSARF is different from the state-of-the-art techniques in that it parameterizes the weights in terms of a deep switching vector auto-regressive likelihood governed with a Markovian prior, which is able to capture the non-linear inter-dependencies among weights to characterize multimodal temporal dynamics. This results in a flexible hierarchical deep generative factor analysis model that can be extended to (i) provide a collection of potentially interpretable states abstracted from the process dynamics, and (ii) perform short- and long-term vector time series prediction in a complex multi-relational setting. Our extensive experiments, which include simulated data and real data from a wide range of applications such as climate change, weather forecasting, traffic, infectious disease spread and nonlinear physical systems attest the superior performance of DSARF in terms of long- and short-term prediction error, when compared with the state-of-the-art methods.

preprint2020arXiv

G-LBM:Generative Low-dimensional Background Model Estimation from Video Sequences

In this paper, we propose a computationally tractable and theoretically supported non-linear low-dimensional generative model to represent real-world data in the presence of noise and sparse outliers. The non-linear low-dimensional manifold discovery of data is done through describing a joint distribution over observations, and their low-dimensional representations (i.e. manifold coordinates). Our model, called generative low-dimensional background model (G-LBM) admits variational operations on the distribution of the manifold coordinates and simultaneously generates a low-rank structure of the latent manifold given the data. Therefore, our probabilistic model contains the intuition of the non-probabilistic low-dimensional manifold learning. G-LBM selects the intrinsic dimensionality of the underling manifold of the observations, and its probabilistic nature models the noise in the observation data. G-LBM has direct application in the background scenes model estimation from video sequences and we have evaluated its performance on SBMnet-2016 and BMC2012 datasets, where it achieved a performance higher or comparable to other state-of-the-art methods while being agnostic to the background scenes in videos. Besides, in challenges such as camera jitter and background motion, G-LBM is able to robustly estimate the background by effectively modeling the uncertainties in video observations in these scenarios.

preprint2020arXiv

Simultaneously-Collected Multimodal Lying Pose Dataset: Towards In-Bed Human Pose Monitoring under Adverse Vision Conditions

Computer vision (CV) has achieved great success in interpreting semantic meanings from images, yet CV algorithms can be brittle for tasks with adverse vision conditions and the ones suffering from data/label pair limitation. One of this tasks is in-bed human pose estimation, which has significant values in many healthcare applications. In-bed pose monitoring in natural settings could involve complete darkness or full occlusion. Furthermore, the lack of publicly available in-bed pose datasets hinders the use of many successful pose estimation algorithms for this task. In this paper, we introduce our Simultaneously-collected multimodal Lying Pose (SLP) dataset, which includes in-bed pose images from 109 participants captured using multiple imaging modalities including RGB, long wave infrared, depth, and pressure map. We also present a physical hyper parameter tuning strategy for ground truth pose label generation under extreme conditions such as lights off and being fully covered by a sheet/blanket. SLP design is compatible with the mainstream human pose datasets, therefore, the state-of-the-art 2D pose estimation models can be trained effectively with SLP data with promising performance as high as 95% at PCKh@0.5 on a single modality. The pose estimation performance can be further improved by including additional modalities through collaboration.