Source author record

Anna Goldenberg

Anna Goldenberg appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Applications Artificial Intelligence Computational Engineering, Finance, and Science Computer Vision eess.IV Genomics Numerical Analysis

Catalog footprint

What is connected

9works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

LiBaGS: Lightweight Boundary Gap Synthesis for Targeted Synthetic Data Selection

Synthetic data is useful only when the added samples fill missing parts of the training distribution that matter for the downstream task. We introduce LiBaGS, a lightweight, generator-agnostic method for targeted synthetic training data selection. LiBaGS scores candidate synthetic samples by combining decision-boundary proximity, predictive uncertainty, real-data density, and support validity, so that selected samples are both informative and likely to remain on the real data manifold. We then use a boundary-gap allocation rule that targets sparse but realistic decision-boundary neighborhoods, rather than simply adding more data or selecting only the most uncertain candidates. LiBaGS also learns when enough synthetic samples have been added through a marginal-value stopping rule, assigns softer labels near ambiguous boundaries, and uses a diversity objective to avoid redundant near-duplicate selections. Experiments show that LiBaGS improves accuracy over classical oversampling, hard augmentation, uncertainty and density ablations, and targeted-generation selection criteria.

preprint2022arXiv

Decoupling Local and Global Representations of Time Series

Real-world time series data are often generated from several sources of variation. Learning representations that capture the factors contributing to this variability enables a better understanding of the data via its underlying generative process and improves performance on downstream machine learning tasks. This paper proposes a novel generative approach for learning representations for the global and local factors of variation in time series. The local representation of each sample models non-stationarity over time with a stochastic process prior, and the global representation of the sample encodes the time-independent characteristics. To encourage decoupling between the representations, we introduce counterfactual regularization that minimizes the mutual information between the two variables. In experiments, we demonstrate successful recovery of the true local and global variability factors on simulated data, and show that representations learned using our method yield superior performance on downstream tasks on real-world datasets. We believe that the proposed way of defining representations is beneficial for data modelling and yields better insights into the complexity of real-world data.

preprint2022arXiv

NODE-GAM: Neural Generalized Additive Model for Interpretable Deep Learning

Deployment of machine learning models in real high-risk settings (e.g. healthcare) often depends not only on the model's accuracy but also on its fairness, robustness, and interpretability. Generalized Additive Models (GAMs) are a class of interpretable models with a long history of use in these high-risk domains, but they lack desirable features of deep learning such as differentiability and scalability. In this work, we propose a neural GAM (NODE-GAM) and neural GA$^2$M (NODE-GA$^2$M) that scale well and perform better than other GAMs on large datasets, while remaining interpretable compared to other ensemble and deep learning models. We demonstrate that our models find interesting patterns in the data. Lastly, we show that we improve model accuracy via self-supervised pre-training, an improvement that is not possible for non-differentiable GAMs.

preprint2020arXiv

A Comprehensive Evaluation of Multi-task Learning and Multi-task Pre-training on EHR Time-series Data

Multi-task learning (MTL) is a machine learning technique aiming to improve model performance by leveraging information across many tasks. It has been used extensively on various data modalities, including electronic health record (EHR) data. However, despite significant use on EHR data, there has been little systematic investigation of the utility of MTL across the diverse set of possible tasks and training schemes of interest in healthcare. In this work, we examine MTL across a battery of tasks on EHR time-series data. We find that while MTL does suffer from common negative transfer, we can realize significant gains via MTL pre-training combined with single-task fine-tuning. We demonstrate that these gains can be achieved in a task-independent manner and offer not only minor improvements under traditional learning, but also notable gains in a few-shot learning context, thereby suggesting this could be a scalable vehicle to offer improved performance in important healthcare contexts.

preprint2020arXiv

Using Generative Models for Pediatric wbMRI

Early detection of cancer is key to a good prognosis and requires frequent testing, especially in pediatrics. Whole-body magnetic resonance imaging (wbMRI) is an essential part of several well-established screening protocols, with screening starting in early childhood. To date, machine learning (ML) has been used on wbMRI images to stage adult cancer patients. It is not possible to use such tools in pediatrics due to the changing bone signal throughout growth, the difficulty of obtaining these images in young children due to movement and limited compliance, and the rarity of positive cases. We evaluate the quality of wbMRI images generated using generative adversarial networks (GANs) trained on wbMRI data from The Hospital for Sick Children in Toronto. We use the Frchet Inception Distance (FID) metric, Domain Frchet Distance (DFD), and blind tests with a radiology fellow for evaluation. We demonstrate that StyleGAN2 provides the best performance in generating wbMRI images with respect to all three metrics.

preprint2016arXiv

Modeling trajectories of mental health: challenges and opportunities

More than two thirds of mental health problems have their onset during childhood or adolescence. Identifying children at risk for mental illness later in life and predicting the type of illness is not easy. We set out to develop a platform to define subtypes of childhood social-emotional development using longitudinal, multifactorial trait-based measures. Subtypes discovered through this study could ultimately advance psychiatric knowledge of the early behavioural signs of mental illness. To this extent we have examined two types of models: latent class mixture models and GP-based models. Our findings indicate that while GP models come close in accuracy of predicting future trajectories, LCMMs predict the trajectories as well in a fraction of the time. Unfortunately, neither of the models are currently accurate enough to lead to immediate clinical impact. The available data related to the development of childhood mental health is often sparse with only a few time points measured and require novel methods with improved efficiency and accuracy.

preprint2015arXiv

Combining exome and gene expression datasets in one graphical model of disease to empower the discovery of disease mechanisms

Identifying genes associated with complex human diseases is one of the main challenges of human genetics and computational medicine. To answer this question, millions of genetic variants get screened to identify a few of importance. To increase the power of identifying genes associated with diseases and to account for other potential sources of protein function aberrations, we propose a novel factor-graph based model, where much of the biological knowledge is incorporated through factors and priors. Our extensive simulations show that our method has superior sensitivity and precision compared to variant-aggregating and differential expression methods. Our integrative approach was able to identify important genes in breast cancer, identifying genes that had coding aberrations in some patients and regulatory abnormalities in others, emphasizing the importance of data integration to explain the disease in a larger number of patients.

preprint2014arXiv

EquiNMF: Graph Regularized Multiview Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) methods have proved to be powerful across a wide range of real-world clustering applications. Integrating multiple types of measurements for the same objects/subjects allows us to gain a deeper understanding of the data and refine the clustering. We have developed a novel Graph-reguarized multiview NMF-based method for data integration called EquiNMF. The parameters for our method are set in a completely automated data-specific unsupervised fashion, a highly desirable property in real-world applications. We performed extensive and comprehensive experiments on multiview imaging data. We show that EquiNMF consistently outperforms other single-view NMF methods used on concatenated data and multi-view NMF methods with different types of regularizations.

preprint2014arXiv

Gradient-based Laplacian Feature Selection

Analysis of high dimensional noisy data is of essence across a variety of research fields. Feature selection techniques are designed to find the relevant feature subset that can facilitate classification or pattern detection. Traditional (supervised) feature selection methods utilize label information to guide the identification of relevant feature subsets. In this paper, however, we consider the unsupervised feature selection problem. Without the label information, it is particularly difficult to identify a small set of relevant features due to the noisy nature of real-world data which corrupts the intrinsic structure of the data. Our Gradient-based Laplacian Feature Selection (GLFS) selects important features by minimizing the variance of the Laplacian regularized least squares regression model. With $\ell_1$ relaxation, GLFS can find a sparse subset of features that is relevant to the Laplacian manifolds. Extensive experiments on simulated, three real-world object recognition and two computational biology datasets, have illustrated the power and superior performance of our approach over multiple state-of-the-art unsupervised feature selection methods. Additionally, we show that GLFS selects a sparser set of more relevant features in a supervised setting outperforming the popular elastic net methodology.

Anna Goldenberg

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

LiBaGS: Lightweight Boundary Gap Synthesis for Targeted Synthetic Data Selection

Decoupling Local and Global Representations of Time Series

NODE-GAM: Neural Generalized Additive Model for Interpretable Deep Learning

A Comprehensive Evaluation of Multi-task Learning and Multi-task Pre-training on EHR Time-series Data

Using Generative Models for Pediatric wbMRI

Modeling trajectories of mental health: challenges and opportunities

Combining exome and gene expression datasets in one graphical model of disease to empower the discovery of disease mechanisms

EquiNMF: Graph Regularized Multiview Nonnegative Matrix Factorization

Gradient-based Laplacian Feature Selection