Source author record

Lucila Ohno-Machado

Lucila Ohno-Machado appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Artificial Intelligence cs.CY Digital Libraries Machine Learning physics.soc-ph Social and Information Networks

Catalog footprint

What is connected

5works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

Evidence derived from large-scale real-world data (RWD) is increasingly informing regulatory evaluation and healthcare decision-making. Administrative claims provide population-scale, longitudinal records of healthcare utilization, expenditure, and detailed coding of diagnoses, procedures, and medications, yet their potential as a substrate for healthcare foundation models remains largely unexplored. Here we present ReClaim, a generative transformer trained from scratch on 43.8 billion medical events from more than 200 million enrollees in the MarketScan claims data spanning 2008-2022. ReClaim models longitudinal trajectories across diagnoses, procedures, medications, and expenditure, and was scaled to 140 million, 700 million, and 1.7 billion parameters. Across over 1,000 disease-onset prediction tasks, ReClaim achieved a mean AUC of 75.6%, substantially outperforming disease-specific LightGBM (66.3%) and the transformer-based Delphi model (69.4%), with the largest gains for rare diseases. These advantages held across retrospective and prospective evaluations and in external validation on two independent datasets. Performance improved monotonically with scale, and post-training added 13.8 percentage points over pre-training alone. Beyond disease prediction, ReClaim captured financial outcomes and improved real-world evidence (RWE) analyses: for healthcare expenditure forecasting it increased explained variance from 0.28 to 0.37 relative to LightGBM, and in a target trial emulation it reduced systematic bias by 72% on average relative to Delphi. Together, these results establish administrative claims as a scalable substrate for healthcare foundation models and show that learned representations generalize across time periods and data sources, supporting disease surveillance, expenditure forecasting, and RWE generation.

preprint2016arXiv

Prerequisites for International Exchanges of Health Information: Comparison of Australian, Austrian, Finnish, Swiss, and US Privacy Policies

Capabilities to exchange health information are critical to accelerate discovery and its diffusion to healthcare practice. However, the same ethical and legal policies that protect privacy hinder these data exchanges, and the issues accumulate if moving data across geographical or organizational borders. This can be seen as one of the reasons why many health technologies and research findings are limited to very narrow domains. In this paper, we compare how using and disclosing personal data for research purposes is addressed in Australian, Austrian, Finnish, Swiss, and US policies with a focus on text data analytics. Our goal is to identify approaches and issues that enable or hinder international health information exchanges. As expected, the policies within each country are not as diverse as across countries. Most policies apply the principles of accountability and/or adequacy and are thereby fundamentally similar. Their following requirements create complications with re-using and re-disclosing data and even secondary data: 1) informing data subjects about the purposes of data collection and use, before the dataset is collected; 2) assurance that the subjects are no longer identifiable; and 3) destruction of data when the research activities are finished. Using storage and compute cloud services as well as other exchange technologies on the Internet without proper permissions is technically not allowed if the data are stored in another country. Both legislation and technologies are available as vehicles for overcoming these barriers. The resulting richness in information variety will contribute to the development and evaluation of new clinical hypotheses and technologies.

preprint2014arXiv

Natural Language Processing in Biomedicine: A Unified System Architecture Overview

In modern electronic medical records (EMR) much of the clinically important data - signs and symptoms, symptom severity, disease status, etc. - are not provided in structured data fields, but rather are encoded in clinician generated narrative text. Natural language processing (NLP) provides a means of "unlocking" this important data source for applications in clinical decision support, quality assurance, and public health. This chapter provides an overview of representative NLP systems in biomedicine based on a unified architectural view. A general architecture in an NLP system consists of two main components: background knowledge that includes biomedical knowledge resources and a framework that integrates NLP tools to process text. Systems differ in both components, which we will review briefly. Additionally, challenges facing current research efforts in biomedical NLP include the paucity of large, publicly available annotated corpora, although initiatives that facilitate data sharing, system evaluation, and collaborative work between researchers in clinical NLP are starting to emerge.

preprint2012arXiv

Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses

Systems that exploit publicly available user generated content such as Twitter messages have been successful in tracking seasonal influenza. We developed a novel filtering method for Influenza-Like-Illnesses (ILI)-related messages using 587 million messages from Twitter micro-blogs. We first filtered messages based on syndrome keywords from the BioCaster Ontology, an extant knowledge model of laymen's terms. We then filtered the messages according to semantic features such as negation, hashtags, emoticons, humor and geography. The data covered 36 weeks for the US 2009 influenza season from 30th August 2009 to 8th May 2010. Results showed that our system achieved the highest Pearson correlation coefficient of 98.46% (p-value<2.2e-16), an improvement of 3.98% over the previous state-of-the-art method. The results indicate that simple NLP-based enhancements to existing approaches to mine Twitter data can increase the value of this inexpensive resource.

preprint2012arXiv

Predicting accurate probabilities with a ranking loss

In many real-world applications of machine learning classifiers, it is essential to predict the probability of an example belonging to a particular class. This paper proposes a simple technique for predicting probabilities based on optimizing a ranking loss, followed by isotonic regression. This semi-parametric technique offers both good ranking and regression performance, and models a richer set of probability distributions than statistical workhorses such as logistic regression. We provide experimental results that show the effectiveness of this technique on real-world applications of probability prediction.

Lucila Ohno-Machado

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

Prerequisites for International Exchanges of Health Information: Comparison of Australian, Austrian, Finnish, Swiss, and US Privacy Policies

Natural Language Processing in Biomedicine: A Unified System Architecture Overview

Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses

Predicting accurate probabilities with a ranking loss