Researcher profile

Yuhui Zhang

Yuhui Zhang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2026arXiv

Anisotropic Modality Align

Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further propose the principle of anisotropic modality gap alignment: effective modality alignment should align with the target-modality distribution while preserving the semantic structure of the source modality. Guided by this principle, we propose an anisotropic geometric correction framework, AnisoAlign, for unpaired modality alignment. This framework leverages the internal geometric prior of the target modality and performs bounded correction on source-modality representations, thereby constructing substitute representations in the target modality. Experiments confirm its benefits in both geometric diagnostics and text-only MLLM training. Overall, this work recasts the modality gap from an empirical observation into a correctable, structured geometric phenomenon and provides a new representation alignment perspective for training multimodal models with unimodal data.

preprint2026arXiv

RadDiff: Describing Differences in Radiology Image Sets with Natural Language

Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning to describe clinically meaningful differences between paired radiology studies. RadDiff builds on a proposer-ranker framework from VisDiff, and incorporates four innovations inspired by real diagnostic workflows: (1) medical knowledge injection through domain-adapted vision-language models; (2) multimodal reasoning that integrates images with their clinical reports; (3) iterative hypothesis refinement across multiple reasoning rounds; and (4) targeted visual search that localizes and zooms in on salient regions to capture subtle findings. To evaluate RadDiff, we construct RadDiffBench, a challenging benchmark comprising 57 expert-validated radiology study pairs with ground-truth difference descriptions. On RadDiffBench, RadDiff achieves 47% accuracy, and 50% accuracy when guided by ground-truth reports, significantly outperforming the general-domain VisDiff baseline. We further demonstrate RadDiff's versatility across diverse clinical tasks, including COVID-19 phenotype comparison, racial subgroup analysis, and discovery of survival-related imaging features. Together, RadDiff and RadDiffBench provide the first method-and-benchmark foundation for systematically uncovering meaningful differences in radiological data.

preprint2022arXiv

On the Opportunities and Risks of Foundation Models

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.

preprint2021arXiv

Collision dominated, ballistic, and viscous regimes of terahertz plasmonic detection by graphene

The terahertz detection performance and operating regimes of graphene plasmonic field-effect transistors (FETs) were investigated by a hydrodynamic model. Continuous wave detection simulations showed that the graphene response sensitivity is similar to that of other materials including Si, InGaAs, GaN, and diamond-based FETs. However, the pulse detection results indicated a very short response time, which favors the rapid/high-sensitively detection. The analysis on the mobility dependence of the response time revealed the same detection regimes as the traditional semiconductor materials, i.e. the non-resonant (collision dominated) regime, the resonant ballistic regime, and the viscous regime. When the kinematic viscosity (ν) is above a certain critical viscosity value, νNR, the plasmonic FETs always operates in the viscous non-resonant regime regardless of channel length (L). In this regime, the response time rises monotonically with the increase of L. When ν < νNR, the plasmonic resonance can be reached in a certain range of L (i.e. the resonant window). Within this window, the carrier transport is ballistic. For a sufficiently short channel, the graphene devices would always operate in the non-resonant regime regardless of the field-effect mobility, corresponding to another viscous regime. The above work mapped the operating regimes of graphene plasmonic FETs, and demonstrated the significance of the viscous effects for the graphene plasmonic detection. These results could be used for the extraction of the temperature dependences of viscosity in graphene.

preprint2020arXiv

Biomedical and Clinical English Model Packages in the Stanza Python NLP Library

We introduce biomedical and clinical English model packages for the Stanza Python NLP library. These packages offer accurate syntactic analysis and named entity recognition capabilities for biomedical and clinical text, by combining Stanza&#39;s fully neural architecture with a wide variety of open datasets as well as large-scale unsupervised biomedical and clinical text data. We show via extensive experiments that our packages achieve syntactic analysis and named entity recognition performance that is on par with or surpasses state-of-the-art results. We further show that these models do not compromise speed compared to existing toolkits when GPU acceleration is available, and are made easy to download and use with Stanza&#39;s Python interface. A demonstration of our packages is available at: http://stanza.run/bio.

preprint2020arXiv

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition. We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested. Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction. Source code, documentation, and pretrained models for 66 languages are available at https://stanfordnlp.github.io/stanza.

preprint2020arXiv

Ultrashort Pulse Detection and Response Time Analysis Using Plasma-wave Terahertz Field Effect Transistors

We report on the response characteristics of plasmonic terahertz field-effect transistors (TeraFETs) fed with femtosecond and picosecond pulses. Varying the pulse width (tpw) from 10-15 s to 10-10 s under a constant input power condition revealed two distinctive pulse detection modes. In the short pulse mode (tpw << L/s, where L is the gated channel length, s is the plasma velocity), the source-to-drain voltage response is a sharp pulse oscillatory decay preceded by a delay time on the order of L/s. The plasma wave travels along the channel like the shallow water wave with a relatively narrow wave package. In the long pulse mode (tpw > L/s), the response profile has two oscillatory decay processes and the propagation of plasma wave is analogues to oscillating rod with one side fixed. The ultimate response time at the long pulse mode is significantly higher than that under the short pulse conditions. The detection conditions under the long pulse mode are close to the step response condition, and the response time conforms well to the analytical theory for the step function response. The simulated waveform agrees well with the measured pulse response. Our results show that the measurements of the pulse response enable the material parameter extraction from the pulse response data (including the effective mass, kinematic viscosity and momentum relaxation time).