Source author record

Siyuan Liu

Siyuan Liu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Machine Learning Computer Vision Databases eess.IV eess.SY Systems and Control Biomolecules Computational Engineering, Finance, and Science cond-mat.mes-hall Cryptography and Security cs.CY physics.chem-ph physics.comp-ph

Catalog footprint

What is connected

17works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Bridging Quantum Mechanics to Organic Liquid Properties via a Universal Force Field

Molecular dynamics (MD) simulations are essential tools for unraveling atomistic insights into the structure and dynamics of condensed-phase systems. However, the universal and accurate prediction of macroscopic properties from ab initio calculations remains a significant challenge, often hindered by the trade-off between computational cost and simulation accuracy. Here, we present ByteFF-Pol, a graph neural network (GNN)-parameterized polarizable force field, trained exclusively on high-level quantum mechanics (QM) data. Leveraging physically-motivated force field forms and training strategies, ByteFF-Pol exhibits exceptional performance in predicting thermodynamic and transport properties for a wide range of small-molecule liquids and electrolytes, outperforming state-of-the-art (SOTA) classical and machine learning force fields. The zero-shot prediction capability of ByteFF-Pol bridges the gap between microscopic QM calculations and macroscopic liquid properties, enabling the exploration of previously intractable chemical spaces. This advancement holds transformative potential for applications such as electrolyte design and custom-tailored solvent, representing a pivotal step toward data-driven materials discovery.

preprint2026arXiv

Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

Data selection during supervised fine-tuning (SFT) can critically change the behavior of large language models (LLMs). Although existing work has studied the effect of selecting data based on heuristics such as perplexity, difficulty, or length, the reported findings are often inconsistent or context-dependent. In this work, we systematically study the role of data difficulty in fine-tuning from both empirical and theoretical perspectives, and find that there is no universally optimal difficulty level; rather, its effectiveness depends on the dataset size. We show that for a fixed data budget, there exists an optimal data difficulty for SFT, and that this optimal difficulty shifts toward harder data as the data budget increases. To explain this phenomenon, we conduct controlled synthetic experiments that reveal a simple underlying mechanism: the interplay between the (in-distribution) generalization gap and the extrapolation gap. We further support this mechanism through a theoretical analysis using PAC-Bayesian generalization bounds. Overall, our results clarify how data size and difficulty jointly affect the trade-off between generalization and extrapolation in SFT, providing guidance for difficulty-based data selection under certain model and data conditions.

preprint2026arXiv

SimRPD: Optimizing Recruitment Proactive Dialogue Agents through Simulator-Based Data Evaluation and Selection

Task-oriented proactive dialogue agents play a pivotal role in recruitment, particularly for steering conversations towards specific business outcomes, such as acquiring social-media contacts for private-channel conversion. Although supervised fine-tuning and reinforcement learning have proven effective for training such agents, their performance is heavily constrained by the scarcity of high-quality, goal-oriented domain-specific training data. To address this challenge, we propose SimRPD, a three-stage framework for training recruitment proactive dialogue agents. First, we develop a high-fidelity user simulator to synthesize large-scale conversational data through multi-turn online dialogue. Then we introduce a multi-dimensional evaluation framework based on Chain-of-Intention (CoI) to comprehensively assess the simulator and effectively select high-quality data, incorporating both global-level and instance-level metrics. Finally, we train the recruitment proactive dialogue agent on the selected dataset. Experiments in a real-world recruitment scenario demonstrate that SimRPD outperforms existing simulator-based data selection strategies, highlighting its practical value for industrial deployment and its potential applicability to other business-oriented dialogue scenarios.

preprint2026arXiv

SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation

Deploying massive diffusion models for real-time, infinite-duration, audio-driven avatar generation presents a significant engineering challenge, primarily due to the conflict between computational load and strict latency constraints. Existing approaches often compromise visual fidelity by enforcing strictly unidirectional attention mechanisms or reducing model capacity. To address this problem, we introduce \textbf{SoulX-FlashTalk}, a 14B-parameter framework optimized for high-fidelity real-time streaming. Diverging from conventional unidirectional paradigms, we use a \textbf{Self-correcting Bidirectional Distillation} strategy that retains bidirectional attention within video chunks. This design preserves critical spatiotemporal correlations, significantly enhancing motion coherence and visual detail. To ensure stability during infinite generation, we incorporate a \textbf{Multi-step Retrospective Self-Correction Mechanism}, enabling the model to autonomously recover from accumulated errors and preventing collapse. Furthermore, we engineered a full-stack inference acceleration suite incorporating hybrid sequence parallelism, Parallel VAE, and kernel-level optimizations. Extensive evaluations confirm that SoulX-FlashTalk is the first 14B-scale system to achieve a \textbf{sub-second start-up latency (0.87s)} while reaching a real-time throughput of \textbf{32 FPS}, setting a new standard for high-fidelity interactive digital human synthesis.

preprint2026arXiv

What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding

Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks, yet their ability to generalize across varying environments remains a under-examined concern. Current evaluation paradigms predominantly rely on trajectory-based metrics that measure task success, while failing to assess whether agents possess a grounded, transferable model of the environment. To address this gap, we propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding. We instantiate this paradigm in T2QBench, a suite comprising 30 environments and 1,967 grounded QA pairs across multiple difficulty levels. Our extensive experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment. These findings identify proactive exploration and fine-grained state representation as primary bottlenecks, offering a robust foundation for developing more generalizable autonomous agents.

preprint2023arXiv

Brain Tissue Segmentation Across the Human Lifespan via Supervised Contrastive Learning

Automatic segmentation of brain MR images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF) is critical for tissue volumetric analysis and cortical surface reconstruction. Due to dramatic structural and appearance changes associated with developmental and aging processes, existing brain tissue segmentation methods are only viable for specific age groups. Consequently, methods developed for one age group may fail for another. In this paper, we make the first attempt to segment brain tissues across the entire human lifespan (0-100 years of age) using a unified deep learning model. To overcome the challenges related to structural variability underpinned by biological processes, intensity inhomogeneity, motion artifacts, scanner-induced differences, and acquisition protocols, we propose to use contrastive learning to improve the quality of feature representations in a latent space for effective lifespan tissue segmentation. We compared our approach with commonly used segmentation methods on a large-scale dataset of 2,464 MR images. Experimental results show that our model accurately segments brain tissues across the lifespan and outperforms existing methods.

preprint2022arXiv

Improving the estimation of directional area scattering factor (DASF) from canopy reflectance: theoretical basis and validation

Directional area scattering factor (DASF) is a critical canopy structural parameter for vegetation monitoring. It provides an efficient tool for decoupling of canopy structure and leaf optics from canopy reflectance. Current standard approach to estimate DASF from canopy bidirectional reflectance factor (BRF) is based on the assumption that in the weakly absorbing 710 to 790 nm spectral interval, leaf scattering does not change much with the concentration of dry matter and thus its variation can be neglected. This results in biased estimates of DASF and consequently leads to uncertainty in DASF-related applications. This study proposes a new approach to account for variations in concentrations of this biochemical constituent, which additionally uses the canopy BRF at 2260 nm. In silico analysis of the proposed approach suggests significant increase in accuracy over the standard technique by a relative root mean square error (rRMSE) of 49% and 34% for one- and three dimensional scenes, respectively. When compared with indoor multi-angular hyperspectral measurements reported in literature, the mean absolute error has reduced by 68% for needle leaf and 20% for broadleaf canopies. Thus, the proposed DASF estimation approach outperforms the current one and can be used more reliably in DASF-related applications, such as vegetation monitoring of functional traits, dynamics, and radiation budget.

preprint2022arXiv

On Understanding the Influence of Controllable Factors with a Feature Attribution Algorithm: a Medical Case Study

Feature attribution XAI algorithms enable their users to gain insight into the underlying patterns of large datasets through their feature importance calculation. Existing feature attribution algorithms treat all features in a dataset homogeneously, which may lead to misinterpretation of consequences of changing feature values. In this work, we consider partitioning features into controllable and uncontrollable parts and propose the Controllable fActor Feature Attribution (CAFA) approach to compute the relative importance of controllable features. We carried out experiments applying CAFA to two existing datasets and our own COVID-19 non-pharmaceutical control measures dataset. Experimental results show that with CAFA, we are able to exclude influences from uncontrollable features in our explanation while keeping the full dataset for prediction.

preprint2022arXiv

Secure-by-Construction Synthesis of Cyber-Physical Systems

Correct-by-construction synthesis is a cornerstone of the confluence of formal methods and control theory towards designing safety-critical systems. Instead of following the time-tested, albeit laborious (re)design-verify-validate loop, correct-by-construction methodology advocates the use of continual refinements of formal requirements -- connected by chains of formal proofs -- to build a system that assures the correctness by design. A remarkable progress has been made in scaling the scope of applicability of correct-by-construction synthesis -- with a focus on cyber-physical systems that tie discrete-event control with continuous environment -- to enlarge control systems by combining symbolic approaches with principled state-space reduction techniques. Unfortunately, in the security-critical control systems, the security properties are verified ex post facto the design process in a way that undermines the correct-by-construction paradigm. We posit that, to truly realize the dream of correct-by-construction synthesis for security-critical systems, security considerations must take center-stage with the safety considerations. Moreover, catalyzed by the recent progress on the opacity sub-classes of security properties and the notion of hyperproperties capable of combining security with safety properties, we believe that the time is ripe for the research community to holistically target the challenge of secure-by-construction synthesis. This paper details our vision by highlighting the recent progress and open challenges that may serve as bricks for providing a solid foundation for secure-by-construction synthesis of cyber-physical systems.

preprint2022arXiv

Strain-driven valley states and phase transitions in Janus VSiGeN4 monolayer

The interplay between topology and valley degree of freedom has attracted much interest because it can realize new phenomena and applications. Here, based on first-principles calculations, we demonstrate intrinsically valley-polarized quantum anomalous Hall effect in monolayer ferrovalley material: Janus VSiGeN4, of which the edge states are chiral-spin-valley locking. Besides, a small tensile or compressive strain can drive phase transition in the material from valley-polarized quantum anomalous Hall state to half-valley-metal state. With the increase of the strain, the material turns into ferrovalley semiconductor with valley anomalous Hall effect. The origin of phase transition is sequent band inversion of V d orbital at K valley. Moreover, we find that phase transition causes the sign reversal of Berry curvature and induces different polarized light absorption in different valley states. Our work provides an ideal material platform for practical applications and experimental exploration of the interplay between topology, spintronics, and valleytronics.

preprint2021arXiv

Compositional synthesis of almost maximally permissible safety controllers

In this work, we present a compositional safety controller synthesis approach for the class of discrete-time linear control systems. Here, we leverage a state-of-the-art result on the computation of robust controlled invariant sets. To tackle the complexity of controller synthesis over complex interconnected systems, this paper introduces a decentralized controller synthesis scheme. Rather than treating the interconnected system as a whole, we first design local safety controllers for each subsystem separately to enforce local safety properties, with polytopic state and input constraints as well as bounded disturbance set. Then, by composing the local controllers, the interconnected system is guaranteed to satisfy the overall safety specification. Finally, we provide a vehicular platooning example to illustrate the effectiveness of the proposed approach by solving the overall safety controller synthesis problem by computing less complex local safety controllers for subsystems and then composing them.

preprint2021arXiv

Exploring the Regulatory Function of the N-terminal Domain of SARS-CoV-2 Spike Protein Through Molecular Dynamics Simulation

SARS-CoV-2 is what has caused the COVID-19 pandemic. Early viral infection is mediated by the SARS-CoV-2 homo-trimeric Spike (S) protein with its receptor binding domains (RBDs) in the receptor-accessible state. We performed molecular dynamics simulation on the S protein with a focus on the function of its N-terminal domains (NTDs). Our study reveals that the NTD acts as a "wedge" and plays a crucial regulatory role in the conformational changes of the S protein. The complete RBD structural transition is allowed only when the neighboring NTD that typically prohibits the RBD's movements as a wedge detaches and swings away. Based on this NTD "wedge" model, we propose that the NTD-RBD interface should be a potential drug target.

preprint2021arXiv

Towards Enhancing Database Education: Natural Language Generation Meets Query Execution Plans

The database systems course is offered as part of an undergraduate computer science degree program in many major universities. A key learning goal of learners taking such a course is to understand how SQL queries are processed in a RDBMS in practice. Since a query execution plan (QEP) describes the execution steps of a query, learners can acquire the understanding by perusing the QEPs generated by a RDBMS. Unfortunately, in practice, it is often daunting for a learner to comprehend these QEPs containing vendor-specific implementation details, hindering her learning process. In this paper, we present a novel, end-to-end, generic system called lantern that generates a natural language description of a qep to facilitate understanding of the query execution steps. It takes as input an SQL query and its QEP, and generates a natural language description of the execution strategy deployed by the underlying RDBMS. Specifically, it deploys a declarative framework called pool that enables subject matter experts to efficiently create and maintain natural language descriptions of physical operators used in QEPs. A rule-based framework called RULE-LANTERN is proposed that exploits pool to generate natural language descriptions of QEPs. Despite the high accuracy of RULE-LANTERN, our engagement with learners reveal that, consistent with existing psychology theories, perusing such rule-based descriptions lead to boredom due to repetitive statements across different QEPs. To address this issue, we present a novel deep learning-based language generation framework called NEURAL-LANTERN that infuses language variability in the generated description by exploiting a set of paraphrasing tools and word embedding. Our experimental study with real learners shows the effectiveness of lantern in facilitating comprehension of QEPs.

preprint2020arXiv

An Investigation of COVID-19 Spreading Factors with Explainable AI Techniques

Since COVID-19 was first identified in December 2019, various public health interventions have been implemented across the world. As different measures are implemented at different countries at different times, we conduct an assessment of the relative effectiveness of the measures implemented in 18 countries and regions using data from 22/01/2020 to 02/04/2020. We compute the top one and two measures that are most effective for the countries and regions studied during the period. Two Explainable AI techniques, SHAP and ECPI, are used in our study; such that we construct (machine learning) models for predicting the instantaneous reproduction number ($R_t$) and use the models as surrogates to the real world and inputs that the greatest influence to our models are seen as measures that are most effective. Across-the-board, city lockdown and contact tracing are the two most effective measures. For ensuring $R_t<1$, public wearing face masks is also important. Mass testing alone is not the most effective measure although when paired with other measures, it can be effective. Warm temperature helps for reducing the transmission.

preprint2020arXiv

Explainable AI for Classification using Probabilistic Logic Inference

The overarching goal of Explainable AI is to develop systems that not only exhibit intelligent behaviours, but also are able to explain their rationale and reveal insights. In explainable machine learning, methods that produce a high level of prediction accuracy as well as transparent explanations are valuable. In this work, we present an explainable classification method. Our method works by first constructing a symbolic Knowledge Base from the training data, and then performing probabilistic inferences on such Knowledge Base with linear programming. Our approach achieves a level of learning performance comparable to that of traditional classifiers such as random forests, support vector machines and neural networks. It identifies decisive features that are responsible for a classification as explanations and produces results similar to the ones found by SHAP, a state of the art Shapley Value based method. Our algorithms perform well on a range of synthetic and non-synthetic data sets.

preprint2020arXiv

Real-Time Quality Assessment of Pediatric MRI via Semi-Supervised Deep Nonlocal Residual Neural Networks

In this paper, we introduce an image quality assessment (IQA) method for pediatric T1- and T2-weighted MR images. IQA is first performed slice-wise using a nonlocal residual neural network (NR-Net) and then volume-wise by agglomerating the slice QA results using random forest. Our method requires only a small amount of quality-annotated images for training and is designed to be robust to annotation noise that might occur due to rater errors and the inevitable mix of good and bad slices in an image volume. Using a small set of quality-assessed images, we pre-train NR-Net to annotate each image slice with an initial quality rating (i.e., pass, questionable, fail), which we then refine by semi-supervised learning and iterative self-training. Experimental results demonstrate that our method, trained using only samples of modest size, exhibit great generalizability, capable of real-time (milliseconds per volume) large-scale IQA with near-perfect accuracy.

preprint2014arXiv

ALID: Scalable Dominant Cluster Detection

Detecting dominant clusters is important in many analytic applications. The state-of-the-art methods find dense subgraphs on the affinity graph as the dominant clusters. However, the time and space complexity of those methods are dominated by the construction of the affinity graph, which is quadratic with respect to the number of data points, and thus impractical on large data sets. To tackle the challenge, in this paper, we apply Evolutionary Game Theory (EGT) and develop a scalable algorithm, Approximate Localized Infection Immunization Dynamics (ALID). The major idea is to perform Localized Infection Immunization Dynamics (LID) to find dense subgraph within local range of the affinity graph. LID is further scaled up with guaranteed high efficiency and detection quality by an estimated Region of Interest (ROI) and a carefully designed Candidate Infective Vertex Search method (CIVS). ALID only constructs small local affinity graphs and has a time complexity of O(C(a^*+ δ)n) and a space complexity of O(a^*(a^*+ δ)), where a^* is the size of the largest dominant cluster and C << n and δ << n are small constants. We demonstrate by extensive experiments on both synthetic data and real world data that ALID achieves state-of-the-art detection quality with much lower time and space cost on single machine. We also demonstrate the encouraging parallelization performance of ALID by implementing the Parallel ALID (PALID) on Apache Spark. PALID processes 50 million SIFT data points in 2.29 hours, achieving a speedup ratio of 7.51 with 8 executors.

Siyuan Liu

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

Bridging Quantum Mechanics to Organic Liquid Properties via a Universal Force Field

Data Difficulty and the Generalization--Extrapolation Tradeoff in LLM Fine-Tuning

SimRPD: Optimizing Recruitment Proactive Dialogue Agents through Simulator-Based Data Evaluation and Selection

SoulX-FlashTalk: Real-Time Infinite Streaming of Audio-Driven Avatars via Self-Correcting Bidirectional Distillation

What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding

Brain Tissue Segmentation Across the Human Lifespan via Supervised Contrastive Learning

Improving the estimation of directional area scattering factor (DASF) from canopy reflectance: theoretical basis and validation

On Understanding the Influence of Controllable Factors with a Feature Attribution Algorithm: a Medical Case Study

Secure-by-Construction Synthesis of Cyber-Physical Systems

Strain-driven valley states and phase transitions in Janus VSiGeN4 monolayer

Compositional synthesis of almost maximally permissible safety controllers

Exploring the Regulatory Function of the N-terminal Domain of SARS-CoV-2 Spike Protein Through Molecular Dynamics Simulation

Towards Enhancing Database Education: Natural Language Generation Meets Query Execution Plans

An Investigation of COVID-19 Spreading Factors with Explainable AI Techniques

Explainable AI for Classification using Probabilistic Logic Inference

Real-Time Quality Assessment of Pediatric MRI via Semi-Supervised Deep Nonlocal Residual Neural Networks

ALID: Scalable Dominant Cluster Detection