Source author record

Helen Chen

Helen Chen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Computation and Language Other Quantitative Biology physics.flu-dyn physics.med-ph Social and Information Networks

Catalog footprint

What is connected

5works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

SynQP: A Framework and Metrics for Evaluating the Quality and Privacy Risk of Synthetic Data

The use of synthetic data in health applications raises privacy concerns, yet the lack of open frameworks for privacy evaluations has slowed its adoption. A major challenge is the absence of accessible benchmark datasets for evaluating privacy risks, due to difficulties in acquiring sensitive data. To address this, we introduce SynQP, an open framework for benchmarking privacy in synthetic data generation (SDG) using simulated sensitive data, ensuring that original data remains confidential. We also highlight the need for privacy metrics that fairly account for the probabilistic nature of machine learning models. As a demonstration, we use SynQP to benchmark CTGAN and propose a new identity disclosure risk metric that offers a more accurate estimation of privacy risks compared to existing approaches. Our work provides a critical tool for improving the transparency and reliability of privacy evaluations, enabling safer use of synthetic data in health-related applications. % In our quality evaluations, non-private models achieved near-perfect machine-learning efficacy \(\ge0.97\). Our privacy assessments (Table II) reveal that DP consistently lowers both identity disclosure risk (SD-IDR) and membership-inference attack risk (SD-MIA), with all DP-augmented models staying below the 0.09 regulatory threshold. Code available at https://github.com/CAN-SYNH/SynQP

preprint2022arXiv

ICDBigBird: A Contextual Embedding Model for ICD Code Classification

The International Classification of Diseases (ICD) system is the international standard for classifying diseases and procedures during a healthcare encounter and is widely used for healthcare reporting and management purposes. Assigning correct codes for clinical procedures is important for clinical, operational, and financial decision-making in healthcare. Contextual word embedding models have achieved state-of-the-art results in multiple NLP tasks. However, these models have yet to achieve state-of-the-art results in the ICD classification task since one of their main disadvantages is that they can only process documents that contain a small number of tokens which is rarely the case with real patient notes. In this paper, we introduce ICDBigBird a BigBird-based model which can integrate a Graph Convolutional Network (GCN), that takes advantage of the relations between ICD codes in order to create 'enriched' representations of their embeddings, with a BigBird contextual model that can process larger documents. Our experiments on a real-world clinical dataset demonstrate the effectiveness of our BigBird-based model on the ICD classification task as it outperforms the previous state-of-the-art models.

preprint2022arXiv

LexSubCon: Integrating Knowledge from Lexical Resources into Contextual Embeddings for Lexical Substitution

Lexical substitution is the task of generating meaningful substitutes for a word in a given textual context. Contextual word embedding models have achieved state-of-the-art results in the lexical substitution task by relying on contextual information extracted from the replaced word within the sentence. However, such models do not take into account structured knowledge that exists in external lexical databases. We introduce LexSubCon, an end-to-end lexical substitution framework based on contextual embedding models that can identify highly accurate substitute candidates. This is achieved by combining contextual information with knowledge from structured lexical resources. Our approach involves: (i) introducing a novel mix-up embedding strategy in the creation of the input embedding of the target word through linearly interpolating the pair of the target input embedding and the average embedding of its probable synonyms; (ii) considering the similarity of the sentence-definition embeddings of the target word and its proposed candidates; and, (iii) calculating the effect of each substitution in the semantics of the sentence through a fine-tuned sentence similarity model. Our experiments show that LexSubCon outperforms previous state-of-the-art methods on LS07 and CoInCo benchmark datasets that are widely used for lexical substitution tasks.

preprint2021arXiv

What social media told about us in the time of COVID-19: a scoping review

With the onset of COVID-19 pandemic, social media has rapidly become a crucial communication tool for information generation, dissemination, and consumption. In this scoping review, we selected and examined peer-reviewed empirical studies relating to COVID-19 and social media during the first outbreak starting in November 2019 until May 2020. From an analysis of 81 studies, we identified five overarching public health themes concerning the role of online social platforms and COVID-19. These themes focused on: (i) surveying public attitudes, (ii) identifying infodemics, (iii) assessing mental health, (iv) detecting or predicting COVID-19 cases, (v) analyzing government responses to the pandemic, and (vi) evaluating quality of health information in prevention education videos. Furthermore, our review highlights the paucity of studies on the application of machine learning on social media data related to COVID-19 and a lack of studies documenting real-time surveillance developed with social media data on COVID-19. For COVID-19, social media can play a crucial role in disseminating health information as well as tackling infodemics and misinformation.

preprint2020arXiv

In situ Measurement of Airborne Particle Concentration in a Real Dental Office: Implications for Disease Transmission

Recent guidelines by WHO recommend delaying non-essential oral health care amid COVID-19 pandemic and call for research on aerosol generated during dental procedures. Thus, this study aims to assess the mechanisms of dental aerosol dispersion in dental offices and to provide recommendations based on a quantitative study to minimize infection transmission in dental offices. The spread and removal of aerosol particles generated from dental procedures in a dental office are measured near the source and at the corner of the office. We studied the effects of air purification (on/off), door condition (open/close), and particle sizes on the temporal concentration distribution of particles. The results show that in the worst-scenario scenario it takes 95 min for 0.5 um particles to settle, and that it takes a shorter time for the larger particles. The indoor air purifier tested expedited the removal time at least 6.3 times faster than the scenario air purifier off. Airborne particles may be transported from the source to the rest of the room, even when the particle concentrations in the generation zone return to the background level. These results are expected to be valuable to related policy making and technology development for infection disease control in dental offices and similar built environments.