Source author record

Yu Fan

Yu Fan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computation and Language Applications eess.IV Machine Learning physics.soc-ph Quantitative Methods Social and Information Networks

Catalog footprint

What is connected

4works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Can Reasoning Help Large Language Models Capture Human Annotator Disagreement?

Variation in human annotation (i.e., disagreements) is common in NLP, often reflecting important information like task subjectivity and sample ambiguity. Modeling this variation is important for applications that are sensitive to such information. Although RLVR-style reasoning (Reinforcement Learning with Verifiable Rewards) has improved Large Language Model (LLM) performance on many tasks, it remains unclear whether such reasoning enables LLMs to capture informative variation in human annotation. In this work, we evaluate the influence of different reasoning settings on LLM disagreement modeling. We systematically evaluate each reasoning setting across model sizes, distribution expression methods, and steering methods, resulting in 60 experimental setups across 3 tasks. Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling, while naive Chain-of-Thought (CoT) reasoning improves the performance of RLHF LLMs (RL from human feedback). These findings underscore the potential risk of replacing human annotators with reasoning LLMs, especially when disagreements are important.

preprint2026arXiv

pdfQA: Diverse, Challenging, and Realistic Question Answering over PDFs

PDFs are the second-most used document type on the internet (after HTML). Yet, existing QA datasets commonly start from text sources or only address specific domains. In this paper, we present pdfQA, a multi-domain 2K human-annotated (real-pdfQA) and 2K synthetic dataset (syn-pdfQA) differentiating QA pairs in ten complexity dimensions (e.g., file type, source modality, source position, answer type). We apply and evaluate quality and difficulty filters on both datasets, obtaining valid and challenging QA pairs. We answer the questions with open-source LLMs, revealing existing challenges that correlate with our complexity dimensions. pdfQA presents a basis for end-to-end QA pipeline evaluation, testing diverse skill sets and local optimizations (e.g., in information retrieval or parsing).

preprint2020arXiv

Histopathological imaging features- versus molecular measurements-based cancer prognosis modeling

For most if not all cancers, prognosis is of significant importance, and extensive modeling research has been conducted. With the genetic nature of cancer, in the past two decades, multiple types of molecular data (such as gene expressions and DNA mutations) have been explored. More recently, histopathological imaging data, which is routinely collected in biopsy, has been shown as informative for modeling prognosis. In this study, using the TCGA LUAD and LUSC data as a showcase, we examine and compare modeling lung cancer overall survival using gene expressions versus histopathological imaging features. High-dimensional regularization methods are adopted for estimation and selection. Our analysis shows that gene expressions have slightly better prognostic performance. In addition, most of the gene expressions are found to be weakly correlated imaging features. It is expected that this study can provide some insight into utilizing the two types of important data in cancer prognosis modeling and into lung cancer overall survival.

preprint2012arXiv

Learning Continuous-Time Social Network Dynamics

We demonstrate that a number of sociology models for social network dynamics can be viewed as continuous time Bayesian networks (CTBNs). A sampling-based approximate inference method for CTBNs can be used as the basis of an expectation-maximization procedure that achieves better accuracy in estimating the parameters of the model than the standard method of moments algorithmfromthe sociology literature. We extend the existing social network models to allow for indirect and asynchronous observations of the links. A Markov chain Monte Carlo sampling algorithm for this new model permits estimation and inference. We provide results on both a synthetic network (for verification) and real social network data.