Researcher profile

Shengyuan Liu

Shengyuan Liu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 17 - UnverifiedVerification L1Unclaimed author
4works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2026arXiv

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.

preprint2026arXiv

Unveiling and Bridging the Functional Perception Gap in MLLMs: Atomic Visual Alignment and Hierarchical Evaluation via PET-Bench

While Multimodal Large Language Models (MLLMs) have demonstrated remarkable proficiency in tasks such as abnormality detection and report generation for anatomical modalities, their capability in functional imaging remains largely unexplored. In this work, we identify and quantify a fundamental functional perception gap: the inability of current vision encoders to decode functional tracer biodistribution independent of morphological priors. Identifying Positron Emission Tomography (PET) as the quintessential modality to investigate this disconnect, we introduce PET-Bench, the first large-scale functional imaging benchmark comprising 52,308 hierarchical QA pairs from 9,732 multi-site, multi-tracer PET studies. Extensive evaluation of 19 state-of-the-art MLLMs reveals a critical safety hazard termed the Chain-of-Thought (CoT) hallucination trap. We observe that standard CoT prompting, widely considered to enhance reasoning, paradoxically decouples linguistic generation from visual evidence in PET, producing clinically fluent but factually ungrounded diagnoses. To resolve this, we propose Atomic Visual Alignment (AVA), a simple fine-tuning strategy that enforces the mastery of low-level functional perception prior to high-level diagnostic reasoning. Our results demonstrate that AVA effectively bridges the perception gap, transforming CoT from a source of hallucination into a robust inference tool and improving diagnostic accuracy by up to 14.83%. Code and data are available at https://github.com/yezanting/PET-Bench.

preprint2022arXiv

ALMA Survey of Orion Planck Galactic Cold Clumps (ALMASOP): How do dense core properties affect the multiplicity of protostars?

During the transition phase from a prestellar to a protostellar cloud core, one or several protostars can form within a single gas core. The detailed physical processes of this transition, however, still remain unclear. We present 1.3 mm dust continuum and molecular line observations with the Atacama Large Millimeter/submillimeter Array (ALMA) toward 43 protostellar cores in the Orion Molecular Cloud Complex ($λ$ Orionis, Orion B, and Orion A) with an angular resolution of $\sim$ 0.35" ($\sim$ 140 au). In total, we detect 13 binary/multiple systems. We derive an overall multiplicity frequency (MF) of 28$\%$ $\pm$ 4$\%$ and a companion star fraction (CSF) of 51$\%$ $\pm$ 6$\%$, over a separation range of 300-8900 au. The median separation of companions is about 2100 au. The occurrence of stellar multiplicity may depend on the physical characteristics of the dense cores. Notably, those containing binary/multiple systems tend to show higher gas density and Mach number than cores forming single stars. The integral-shaped filament (ISF) of Orion A giant molecular cloud (GMC), which has the highest gas density and hosts high-mass star formation in its central region (the Orion Nebula cluster), shows the highest MF and CSF among the Orion GMCs. In contrast, the $λ$ Orionis Giant Molecular Cloud (GMC) has a lower MF and CSF than the Orion B and Orion A GMCs, indicating that feedback from HII regions may suppress the formation of multiple systems. We also find that the protostars comprising a binary/multiple system are usually at different evolutionary stages.

preprint2022arXiv

Build Smart Grids on Artificial Intelligence -- A Real-world Example

Power grid data are going big with the deployment of various sensors. The big data in power grids creates huge opportunities for applying artificial intelligence technologies to improve resilience and reliability. This paper introduces multiple real-world applications based on artificial intelligence to improve power grid situational awareness and resilience. These applications include event identification, inertia estimation, event location and magnitude estimation, data authentication, control, and stability assessment. These applications are operating on a real-world system called FNET-GridEye, which is a wide-area measurement network and arguably the world-largest cyber-physical system that collects power grid big data. These applications showed much better performance compared with conventional approaches and accomplished new tasks that are impossible to realized using conventional technologies. These encouraging results demonstrate that combining power grid big data and artificial intelligence can uncover and capture the non-linear correlation between power grid data and its stabilities indices and will potentially enable many advanced applications that can significantly improve power grid resilience.