Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
17works
0followers
16topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

17 published item(s)

preprint2026arXiv

Inverse Knowledge Search over Verifiable Reasoning: Synthesizing a Scientific Encyclopedia from a Long Chains-of-Thought Knowledge Base

Most scientific materials compress reasoning, presenting conclusions while omitting the derivational chains that justify them. This compression hinders verification by lacking explicit, step-wise justifications and inhibits cross-domain links by collapsing the very pathways that establish the logical and causal connections between concepts. We introduce a scalable framework that decompresses scientific reasoning, constructing a verifiable Long Chain-of-Thought (LCoT) knowledge base and projecting it into an emergent encyclopedia, SciencePedia. Our pipeline operationalizes an endpoint-driven, reductionist strategy: a Socratic agent, guided by a curriculum of around 200 courses, generates approximately 3 million first-principles questions. To ensure high fidelity, multiple independent solver models generate LCoTs, which are then rigorously filtered by prompt sanitization and cross-model answer consensus, retaining only those with verifiable endpoints. This verified corpus powers the Brainstorm Search Engine, which performs inverse knowledge search -- retrieving diverse, first-principles derivations that culminate in a target concept. This engine, in turn, feeds the Plato synthesizer, which narrates these verified chains into coherent articles. The initial SciencePedia comprises approximately 200,000 fine-grained entries spanning mathematics, physics, chemistry, biology, engineering, and computation. In evaluations across six disciplines, Plato-synthesized articles (conditioned on retrieved LCoTs) exhibit substantially higher knowledge-point density and significantly lower factual error rates than an equally-prompted baseline without retrieval (as judged by an external LLM). Built on this verifiable LCoT knowledge base, this reasoning-centric approach enables trustworthy, cross-domain scientific synthesis at scale and establishes the foundation for an ever-expanding encyclopedia.

preprint2026arXiv

Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics

With the continuous advancement of reasoning abilities in Large Language Models (LLMs), their application to scientific reasoning tasks has gained significant research attention. Current research primarily emphasizes boosting LLMs' performance on scientific QA benchmarks by training on larger, more comprehensive datasets with extended reasoning chains. However, these approaches neglect the essence of the scientific reasoning process -- logicality, which is the rational foundation to ensure the validity of reasoning steps leading to reliable conclusions. In this work, we make the first systematic investigation into the internal logicality underlying LLM scientific reasoning, and develop a scientific logicality-enriched methodology, including a set of assessment criteria and data sampling methods for logicality-guided training, to improve the logical faithfulness as well as task performance. Further, we take physics, characterized by its diverse logical structures and formalisms, as an exemplar discipline to practise the above methodology. For data construction, we extract scientific problems from academic literature and sample a high-quality dataset exhibiting strong logicality. Experiments based on three different backbone LLMs reveal that: 1) the training data we constructed can effectively improve the scientific logicality in LLM reasoning; and 2) the enriched scientific logicality plays a critical role in solving scientific problems. Code is available at \href{https://github.com/ScienceOne-AI/PhysLogic}{https://github.com/ScienceOne-AI/PhysLogic}.

preprint2026arXiv

Structural Dilemmas and Developmental Pathways of Legal Argument Mining in the Era of Artificial Intelligence

Against the backdrop of rapid advances in artificial intelligence, legal argument mining has emerged as an important research area linking legal texts with intelligent analysis, carrying significant theoretical and practical implications. Existing studies have primarily developed along three dimensions: data, technology, and theory. At the data level, raw legal texts and annotated corpora constitute the foundational resources. At the technological level, research paradigms have evolved from rule-based systems and traditional machine learning to large language models (LLMs). At the theoretical level, argumentation theory and legal dogmatics provide important references for modeling argumentation structures. However, despite ongoing progress, the overall development of legal argument mining remains relatively slow. Building on a systematic review of existing research, this study conducts an in-depth analysis and finds that this is due not only to data scarcity or technical limitations, but more fundamentally to the lack of a structured representational approach that reconciles theoretical expressiveness with computational feasibility. Specifically, this challenge manifests in dilemmas in data standardization, obstacles to effective modeling, and limitations in domain adaptation. In response, the study proposes several key directions for future research. It aims to provide a reframing of key problems and a pathway for future development in legal argument mining, while leaving specific models and implementation schemes for further investigation.

preprint2026arXiv

Superconductivity in Electron Liquids: Precision Many-Body Treatment of Coulomb Interaction

More than a century after discovery, the theory of conventional superconductivity remains incomplete. While the importance of electron-phonon coupling is understood, a controlled first-principles treatment of Coulomb interaction is lacking. Current ab initio calculations of superconductivity rely on a phenomenological downfolding approximation, replacing Coulomb interaction with a repulsive pseudopotential μ*, and leaving ambiguities in electron-phonon coupling with dynamical Coulomb interactions unresolved. We address this via an effective field theory approach, integrating out high-energy electronic degrees of freedom using variational Diagrammatic Monte Carlo. Applied to the uniform electron gas, this establishes a microscopic procedure to implement downfolding, define the pseudopotential, and express dynamical Coulomb effects on electron-phonon coupling via the electron vertex function. We find the bare pseudopotential significantly larger than conventional values. This yields improved pseudopotential estimates in simple metals and tests density functional perturbation theory accuracy for effective electron-phonon coupling. We present an ab initio workflow computing superconducting Tc from the anomalous vertex's precursory Cooper flow. This infers Tc from normal state calculations, enabling reliable estimates of very low Tc (including near quantum phase transitions) beyond conventional reach. Validating our approach on simple metals without empirical tuning, we resolve long-standing discrepancies and predict a pressure-induced transition in Al from superconducting to non-superconducting above ~60GPa. We propose ambient-pressure Mg and Na are proximal to a similar critical point. Our work establishes a controlled ab initio framework for electron-phonon superconductivity beyond the weak-correlation limit, paving the way for reliable Tc calculations and novel material design.

preprint2025arXiv

How Would Oblivious Memory Boost Graph Analytics on Trusted Processors?

Trusted processors provide a way to perform joint computations while preserving data privacy. To overcome the performance degradation caused by data-oblivious algorithms to prevent information leakage, we explore the benefits of oblivious memory (OM) integrated in processors, to which the accesses are unobservable by adversaries. We focus on graph analytics, an important application vulnerable to access-pattern attacks. With a co-design between storage structure and algorithms, our prototype system is 100x faster than baselines given an OM sized around the per-core cache which can be implemented on existing processors with negligible overhead. This gives insights into equipping trusted processors with OM.

preprint2022arXiv

A Scalable Actor-based Programming System for PGAS Runtimes

The PGAS model is well suited for executing irregular applications on cluster-based systems, due to its efficient support for short, one-sided messages. However, there are currently two major limitations faced by PGAS applications. The first relates to scalability: despite the availability of APIs that support non-blocking operations in special cases, many PGAS operations on remote locations are synchronous by default, which can lead to long-latency stalls and poor scalability. The second relates to productivity: while it is simpler for the developer to express all communications at a fine-grained granularity that is natural to the application, experiments have shown that such a natural expression results in performance that is 20x slower than more efficient but less productive code that requires manual message aggregation and termination detection. In this paper, we introduce a new programming system for PGAS applications, in which point-to-point remote operations can be expressed as fine-grained asynchronous actor messages. In this approach, the programmer does not need to worry about programming complexities related to message aggregation and termination detection. Our approach can also be viewed as extending the classical Bulk Synchronous Parallelism model with fine-grained asynchronous communications within a phase or superstep. We believe that our approach offers a desirable point in the productivity-performance space for PGAS applications, with more scalable performance and higher productivity relative to past approaches. Specifically, for seven irregular mini-applications from the Bale benchmark suite executed using 2048 cores in the NERSC Cori system, our approach shows geometric mean performance improvements of >=20x relative to standard PGAS versions (UPC and OpenSHMEM) while maintaining comparable productivity to those versions.

preprint2022arXiv

Collaboration Equilibrium in Federated Learning

Federated learning (FL) refers to the paradigm of learning models over a collaborative research network involving multiple clients without sacrificing privacy. Recently, there have been rising concerns on the distributional discrepancies across different clients, which could even cause counterproductive consequences when collaborating with others. While it is not necessarily that collaborating with all clients will achieve the best performance, in this paper, we study a rational collaboration called ``collaboration equilibrium'' (CE), where smaller collaboration coalitions are formed. Each client collaborates with certain members who maximally improve the model learning and isolates the others who make little contribution. We propose the concept of benefit graph which describes how each client can benefit from collaborating with other clients and advance a Pareto optimization approach to identify the optimal collaborators. Then we theoretically prove that we can reach a CE from the benefit graph through an iterative graph operation. Our framework provides a new way of setting up collaborations in a research network. Experiments on both synthetic and real world data sets are provided to demonstrate the effectiveness of our method.

preprint2022arXiv

Global Convergence Analysis of Deep Linear Networks with A One-neuron Layer

In this paper, we follow Eftekhari's work to give a non-local convergence analysis of deep linear networks. Specifically, we consider optimizing deep linear networks which have a layer with one neuron under quadratic loss. We describe the convergent point of trajectories with arbitrary starting point under gradient flow, including the paths which converge to one of the saddle points or the original point. We also show specific convergence rates of trajectories that converge to the global minimizer by stages. To achieve these results, this paper mainly extends the machinery in Eftekhari's work to provably identify the rank-stable set and the global minimizer convergent set. We also give specific examples to show the necessity of our definitions. Crucially, as far as we know, our results appear to be the first to give a non-local global analysis of linear neural networks from arbitrary initialized points, rather than the lazy training regime which has dominated the literature of neural networks, and restricted benign initialization in Eftekhari's work. We also note that extending our results to general linear networks without one hidden neuron assumption remains a challenging open problem.

preprint2022arXiv

Principal Amalgamation Analysis for Microbiome Data

In recent years microbiome studies have become increasingly prevalent and large-scale. Through high-throughput sequencing technologies and well-established analytical pipelines, relative abundance data of operational taxonomic units and their associated taxonomic structures are routinely produced. Since such data can be extremely sparse and high dimensional, there is often a genuine need for dimension reduction to facilitate data visualization and downstream statistical analysis. We propose Principal Amalgamation Analysis (PAA), a novel amalgamation-based and taxonomy-guided dimension reduction paradigm for microbiome data. Our approach aims to aggregate the compositions into a smaller number of principal compositions, guided by the available taxonomic structure, by minimizing a properly measured loss of information. The choice of the loss function is flexible and can be based on familiar diversity indices for preserving either within-sample or between-sample diversity in the data. To enable scalable computation, we develop a hierarchical PAA algorithm to trace the entire trajectory of successive simple amalgamations. Visualization tools including dendrogram, scree plot, and ordination plot are developed. The effectiveness of PAA is demonstrated using gut microbiome data from a preterm infant study and an HIV infection study.

preprint2022arXiv

Superconductivity in the Uniform Electron Gas: Irrelevance of Kohn-Luttinger Mechanism

We study the Cooper instability in jellium model in the controlled regime of small to intermediate values of the Coulomb parameter $r_s \leq 2$. We confirm that superconductivity naturally emerges from purely repulsive interactions described by the Kukkonen-Overhauser vertex function. By employing the implicit renormalization approach we reveal that even in the small-$r_s$ limit, the dominant mechanism behind Cooper instability is based on dynamic screening of the Coulomb interaction--accurately captured by the random phase approximation, whereas the Kohn-Luttinger contribution is negligibly small and, thus, not relevant.

preprint2022arXiv

Where to find lossless metals?

Hypothetical metals having optical absorption losses as low as those of the transparent insulators, if found, could revolutionize optoelectronics. We perform the first high-throughput search for lossless metals among all known inorganic materials in the databases of over 100,000 entries. The 381 candidates are identified -- having well-isolated partially-filled bands -- and are analyzed by defining the figures of merit and classifying their real-space conductive connectivity. The existing experimental evidence of most candidates being insulating, instead of conducting, is due to the limitation of current density functional theory in predicting narrow-band metals that are unstable against magnetism, structural distortion, or electron-electron interactions. We propose future research directions including conductive oxides, intercalating layered materials, and compressing these false-metal candidates under high pressures into eventual lossless metals.

preprint2021arXiv

Insights into the Electron-Electron Interaction from Quantum Monte Carlo Calculations

The effective electron-electron interaction in the electron gas depends on both the density and spin local field factors. Variational Diagrammatic Quantum Monte Carlo calculations of the spin local field factor are reported and used to quantitatively present the full spin-dependent, electron-electron interaction. Together with the charge local field factor from previous Diffusion Quantum Monte Carlo calculations, we obtain the complete form of the effective electron-electron interaction in the uniform three-dimensional electron gas. Very simple quadratic formulas are presented for the local field factors that quantitatively produce all of the response functions of the electron gas at metallic densities. Exchange and correlation become increasingly important at low densities. At the compressibility divergence at rs = 5.25, both the direct (screened Coulomb) term and the charge-dependent exchange term in the electron-electron interaction at q=0 are separately divergent. However, due to large cancellations, their difference is finite, well behaved, and much smaller than either term separately. As a result, the spin contribution to the electron-electron interaction becomes an important factor. The static electron-electron interaction is repulsive as a function of density but is less repulsive for electrons with parallel spins. The effect of allowing a deformable, rather than rigid, positive background is shown to be as quantitatively important as exchange and correlation. As a simple concrete example, the electron-electron interaction is calculated using the measured bulk modulus of the alkali metals with a linear phonon dispersion. The net electron-electron interaction in lithium is attractive for wave vectors $0-2k_F$, which suggests superconductivity, and is mostly repulsive for the other alkali metals.

preprint2021arXiv

Pursuing Sources of Heterogeneity in Modeling Clustered Population

Researchers often have to deal with heterogeneous population with mixed regression relationships, increasingly so in the era of data explosion. In such problems, when there are many candidate predictors, it is not only of interest to identify the predictors that are associated with the outcome, but also to distinguish the true sources of heterogeneity, i.e., to identify the predictors that have different effects among the clusters and thus are the true contributors to the formation of the clusters. We clarify the concepts of the source of heterogeneity that account for potential scale differences of the clusters and propose a regularized finite mixture effects regression to achieve heterogeneity pursuit and feature selection simultaneously. As the name suggests, the problem is formulated under an effects-model parameterization, in which the cluster labels are missing and the effect of each predictor on the outcome is decomposed to a common effect term and a set of cluster-specific terms. A constrained sparse estimation of these effects leads to the identification of both the variables with common effects and those with heterogeneous effects. We propose an efficient algorithm and show that our approach can achieve both estimation and selection consistency. Simulation studies further demonstrate the effectiveness of our method under various practical scenarios. Three applications are presented, namely, an imaging genetics study for linking genetic factors and brain neuroimaging traits in Alzheimer's disease, a public health study for exploring the association between suicide risk among adolescents and their school district characteristics, and a sport analytics study for understanding how the salary levels of baseball players are associated with their performance and contractual status.

preprint2020arXiv

Multivariate Functional Regression via Nested Reduced-Rank Regularization

We propose a nested reduced-rank regression (NRRR) approach in fitting regression model with multivariate functional responses and predictors, to achieve tailored dimension reduction and facilitate interpretation/visualization of the resulting functional model. Our approach is based on a two-level low-rank structure imposed on the functional regression surfaces. A global low-rank structure identifies a small set of latent principal functional responses and predictors that drives the underlying regression association. A local low-rank structure then controls the complexity and smoothness of the association between the principal functional responses and predictors. Through a basis expansion approach, the functional problem boils down to an interesting integrated matrix approximation task, where the blocks or submatrices of an integrated low-rank matrix share some common row space and/or column space. An iterative algorithm with convergence guarantee is developed. We establish the consistency of NRRR and also show through non-asymptotic analysis that it can achieve at least a comparable error rate to that of the reduced-rank regression. Simulation studies demonstrate the effectiveness of NRRR. We apply NRRR in an electricity demand problem, to relate the trajectories of the daily electricity consumption with those of the daily temperatures.

preprint2020arXiv

Multivariate Log-Contrast Regression with Sub-Compositional Predictors: Testing the Association Between Preterm Infants' Gut Microbiome and Neurobehavioral Outcomes

The so-called gut-brain axis has stimulated extensive research on microbiomes. One focus is to assess the association between certain clinical outcomes and the relative abundances of gut microbes, which can be presented as sub-compositional data in conformity with the taxonomic hierarchy of bacteria. Motivated by a study for identifying the microbes in the gut microbiome of preterm infants that impact their later neurobehavioral outcomes, we formulate a constrained integrative multi-view regression, where the neurobehavioral scores form multivariate response, the sub-compositional microbiome data form multi-view feature matrices, and a set of linear constraints on their corresponding sub-coefficient matrices ensures the conformity to the simplex geometry. To enable joint selection and inference of sub-compositions/views, we assume all the sub-coefficient matrices are possibly of low-rank, i.e., the outcomes are associated with the microbiome through different sets of latent sub-compositional factors from different taxa. We propose a scaled composite nuclear norm penalization approach for model estimation and develop a hypothesis testing procedure through de-biasing to assess the significance of different views. Simulation studies confirm the effectiveness of the proposed procedure. In the preterm infant study, the identified microbes are mostly consistent with existing studies and biological understandings. Our approach supports that stressful early life experiences imprint gut microbiome through the regulation of the gut-brain axis.

preprint2020arXiv

Statistically Guided Divide-and-Conquer for Sparse Factorization of Large Matrix

The sparse factorization of a large matrix is fundamental in modern statistical learning. In particular, the sparse singular value decomposition and its variants have been utilized in multivariate regression, factor analysis, biclustering, vector time series modeling, among others. The appeal of this factorization is owing to its power in discovering a highly-interpretable latent association network, either between samples and variables or between responses and predictors. However, many existing methods are either ad hoc without a general performance guarantee, or are computationally intensive, rendering them unsuitable for large-scale studies. We formulate the statistical problem as a sparse factor regression and tackle it with a divide-and-conquer approach. In the first stage of division, we consider both sequential and parallel approaches for simplifying the task into a set of co-sparse unit-rank estimation (CURE) problems, and establish the statistical underpinnings of these commonly-adopted and yet poorly understood deflation methods. In the second stage of division, we innovate a contended stagewise learning technique, consisting of a sequence of simple incremental updates, to efficiently trace out the whole solution paths of CURE. Our algorithm has a much lower computational complexity than alternating convex search, and the choice of the step size enables a flexible and principled tradeoff between statistical accuracy and computational efficiency. Our work is among the first to enable stagewise learning for non-convex problems, and the idea can be applicable in many multi-convex problems. Extensive simulation studies and an application in genetics demonstrate the effectiveness and scalability of our approach.

preprint2019arXiv

Quantum-to-classical correspondence in two-dimensional Heisenberg models

The quantum-to-classical correspondence (QCC) in spin models is a puzzling phenomenon where the static susceptibility of a quantum system agrees with its classical-system counterpart, at a different corresponding temperature, within the systematic error at a sub-percent level. We employ the bold diagrammatic Monte Carlo method to explore the universality of QCC by considering three different two-dimensional spin-1/2 Heisenberg models. In particular, we reveal the existence of QCC in two-parametric models.