Researcher profile

Sarvesh Soni

Sarvesh Soni contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 15 - UnverifiedVerification L1Unclaimed author
3works
0followers
3topics
3close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

3 published item(s)

preprint2026arXiv

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect -- such as similarity or utility comparisons -- even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes -- despite their task-agnostic nature -- can effectively augment task-specific training for rare ICD codes.

preprint2020arXiv

An Evaluation of Two Commercial Deep Learning-Based Information Retrieval Systems for COVID-19 Literature

The COVID-19 pandemic has resulted in a tremendous need for access to the latest scientific information, primarily through the use of text mining and search tools. This has led to both corpora for biomedical articles related to COVID-19 (such as the CORD-19 corpus (Wang et al., 2020)) as well as search engines to query such data. While most research in search engines is performed in the academic field of information retrieval (IR), most academic search engines$\unicode{x2013}$though rigorously evaluated$\unicode{x2013}$are sparsely utilized, while major commercial web search engines (e.g., Google, Bing) dominate. This relates to COVID-19 because it can be expected that commercial search engines deployed for the pandemic will gain much higher traction than those produced in academic labs, and thus leads to questions about the empirical performance of these search tools. This paper seeks to empirically evaluate two such commercial search engines for COVID-19, produced by Google and Amazon, in comparison to the more academic prototypes evaluated in the context of the TREC-COVID track (Roberts et al., 2020). We performed several steps to reduce bias in the available manual judgments in order to ensure a fair comparison of the two systems with those submitted to TREC-COVID. We find that the top-performing system from TREC-COVID on bpref metric performed the best among the different systems evaluated in this study on all the metrics. This has implications for developing biomedical retrieval systems for future health crises as well as trust in popular health search engines.

preprint2020arXiv

Patient Cohort Retrieval using Transformer Language Models

We apply deep learning-based language models to the task of patient cohort retrieval (CR) with the aim to assess their efficacy. The task of CR requires the extraction of relevant documents from the electronic health records (EHRs) on the basis of a given query. Given the recent advancements in the field of document retrieval, we map the task of CR to a document retrieval task and apply various deep neural models implemented for the general domain tasks. In this paper, we propose a framework for retrieving patient cohorts using neural language models without the need of explicit feature engineering and domain expertise. We find that a majority of our models outperform the BM25 baseline method on various evaluation metrics.