Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
20topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2026arXiv

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.

preprint2026arXiv

Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data

Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including texts, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs' ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Yet it remains unclear whether such biases are systematic, which data-level factors drive them, and what internal mechanisms underlie their emergence. In this paper, we present the first comprehensive study of format bias in LLMs through a three-stage empirical analysis. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage examines how key data-level factors influence these biases. The third stage analyzes how format bias emerges within LLMs' attention patterns and evaluates a lightweight intervention to test its effectiveness. Our results show that format bias is consistent across model families, driven by information richness, structure quality, and representation type, and is closely associated with attention imbalance within the LLMs. Based on these investigations, we identify three future research directions to reduce format bias: enhancing data pre-processing through format repair and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems.

preprint2024arXiv

Attractive and repulsive interactions in the one-dimensional swarmalator model

We study a population of swarmalators, mobile variants of phase oscillators, which run on a ring and have both attractive and repulsive interactions. This one-dimensional (1D) swarmalator model produces several of collective states: the standard sync and async states as well as a splaylike "polarized" state and several unsteady states such as active bands or swirling. The model's simplicity allows us to describe some of the states analytically. The model can be considered as a toy model for real-world swarmalators such as vinegar eels and sperm which swarm in quasi-1D geometries.

preprint2023arXiv

A Unified and Scalable Algorithm Framework of User-Defined Temporal $(k,\mathcal{X})$-Core Query

Querying cohesive subgraphs on temporal graphs (e.g., social network, finance network, etc.) with various conditions has attracted intensive research interests recently. In this paper, we study a novel Temporal $(k,\mathcal{X})$-Core Query (TXCQ) that extends a fundamental Temporal $k$-Core Query (TCQ) proposed in our conference paper by optimizing or constraining an arbitrary metric $\mathcal{X}$ of $k$-core, such as size, engagement, interaction frequency, time span, burstiness, periodicity, etc. Our objective is to address specific TXCQ instances with conditions on different $\mathcal{X}$ in a unified algorithm framework that guarantees scalability. For that, this journal paper proposes a taxonomy of measurement $\mathcal{X}(\cdot)$ and achieve our objective using a two-phase framework while $\mathcal{X}(\cdot)$ is time-insensitive or time-monotonic. Specifically, Phase 1 still leverages the query processing algorithm of TCQ to induce all distinct $k$-cores during a given time range, and meanwhile locates the ``time zones'' in which the cores emerge. Then, Phase 2 conducts fast local search and $\mathcal{X}$ evaluation in each time zone with respect to the time insensitivity or monotonicity of $\mathcal{X}(\cdot)$. By revealing two insightful concepts named tightest time interval and loosest time interval that bound time zones, the redundant core induction and unnecessary $\mathcal{X}$ evaluation in a zone can be reduced dramatically. Our experimental results demonstrate that TXCQ can be addressed as efficiently as TCQ, which achieves the latest state-of-the-art performance, by using a general algorithm framework that leaves $\mathcal{X}(\cdot)$ as a user-defined function.

preprint2022arXiv

DialogLM: Pre-trained Model for Long Dialogue Understanding and Summarization

Dialogue is an essential part of human communication and cooperation. Existing research mainly focuses on short dialogue scenarios in a one-on-one fashion. However, multi-person interactions in the real world, such as meetings or interviews, are frequently over a few thousand words. There is still a lack of corresponding research and powerful tools to understand and process such long dialogues. Therefore, in this work, we present a pre-training framework for long dialogue understanding and summarization. Considering the nature of long conversations, we propose a window-based denoising approach for generative pre-training. For a dialogue, it corrupts a window of text with dialogue-inspired noise, and guides the model to reconstruct this window based on the content of the remaining conversation. Furthermore, to process longer input, we augment the model with sparse attention which is combined with conventional attention in a hybrid manner. We conduct extensive experiments on five datasets of long dialogues, covering tasks of dialogue summarization, abstractive question answering and topic segmentation. Experimentally, we show that our pre-trained model DialogLM significantly surpasses the state-of-the-art models across datasets and tasks. Source code and all the pre-trained models are available on our GitHub repository (https://github.com/microsoft/DialogLM).

preprint2022arXiv

Learning Interaction Variables and Kernels from Observations of Agent-Based Systems

Dynamical systems across many disciplines are modeled as interacting particles or agents, with interaction rules that depend on a very small number of variables (e.g. pairwise distances, pairwise differences of phases, etc...), functions of the state of pairs of agents. Yet, these interaction rules can generate self-organized dynamics, with complex emergent behaviors (clustering, flocking, swarming, etc.). We propose a learning technique that, given observations of states and velocities along trajectories of the agents, yields both the variables upon which the interaction kernel depends and the interaction kernel itself, in a nonparametric fashion. This yields an effective dimension reduction which avoids the curse of dimensionality from the high-dimensional observation data (states and velocities of all the agents). We demonstrate the learning capability of our method to a variety of first-order interacting systems.

preprint2022arXiv

On the sparsity of LASSO minimizers in sparse data recovery

We present a detailed analysis of the unconstrained $\ell_1$-weighted LASSO method for recovery of sparse data from its observation by randomly generated matrices, satisfying the Restricted Isometry Property (RIP) with constant $δ<1$, and subject to negligible measurement and compressibility errors. We prove that if the data is $k$-sparse, then the size of support of the LASSO minimizer, $s$, maintains a comparable sparsity, $s\leq C_δk$. For example, if $δ=0.7$ then $s< 11k$ and a slightly smaller $δ=0.4$ yields $s< 4k$. We also derive new $\ell_2/\ell_1$ error bounds which highlight precise dependence on $k$ and on the LASSO parameter $λ$, before the error is driven below the scale of negligible measurement/ and compressiblity errors.

preprint2021arXiv

Learning Interaction Kernels for Agent Systems on Riemannian Manifolds

Interacting agent and particle systems are extensively used to model complex phenomena in science and engineering. We consider the problem of learning interaction kernels in these dynamical systems constrained to evolve on Riemannian manifolds from given trajectory data. The models we consider are based on interaction kernels depending on pairwise Riemannian distances between agents, with agents interacting locally along the direction of the shortest geodesic connecting them. We show that our estimators converge at a rate that is independent of the dimension of the state space, and derive bounds on the trajectory estimation error, on the manifold, between the observed and estimated dynamics. We demonstrate the performance of our estimator on two classical first order interacting systems: Opinion Dynamics and a Predator-Swarm system, with each system constrained on two prototypical manifolds, the $2$-dimensional sphere and the Poincaré disk model of hyperbolic space.

preprint2020arXiv

Constructions of regular sparse anti-magic squares

Graph labeling is a well-known and intensively investigated problem in graph theory. Sparse anti-magic squares are useful in constructing vertex-magic labeling for graphs. For positive integers $n,d$ and $d<n$, an $n\times n$ array $A$ based on $\{0,1,\cdots,nd\}$ is called \emph{a sparse anti-magic square of order $n$ with density $d$}, denoted by SAMS$(n,d)$, if each element of $\{1,2,\cdots,nd\}$ occurs exactly one entry of $A$, and its row-sums, column-sums and two main diagonal sums constitute a set of $2n+2$ consecutive integers. An SAMS$(n,d)$ is called \emph{regular} if there are exactly $d$ positive entries in each row, each column and each main diagonal. In this paper, we investigate the existence of regular sparse anti-magic squares of order $n\equiv1,5\pmod 6$, and it is proved that for any $n\equiv1,5\pmod 6$, there exists a regular SAMS$(n,d)$ if and only if $2\leq d\leq n-1$.

preprint2020arXiv

Data-driven Discovery of Emergent Behaviors in Collective Dynamics

Particle- and agent-based systems are a ubiquitous modeling tool in many disciplines. We consider the fundamental problem of inferring interaction kernels from observations of agent-based dynamical systems given observations of trajectories, in particular for collective dynamical systems exhibiting emergent behaviors with complicated interaction kernels, in a nonparametric fashion, and for kernels which are parametrized by a single unknown parameter. We extend the estimators introduced in \cite{PNASLU}, which are based on suitably regularized least squares estimators, to these larger classes of systems. We provide extensive numerical evidence that the estimators provide faithful approximations to the interaction kernels, and provide accurate predictions for trajectories started at new initial conditions, both throughout the ``training&#39;&#39; time interval in which the observations were made, and often much beyond. We demonstrate these features on prototypical systems displaying collective behaviors, ranging from opinion dynamics, flocking dynamics, self-propelling particle dynamics, synchronized oscillator dynamics, and a gravitational system. Our experiments also suggest that our estimated systems can display the same emergent behaviors of the observed systems, that occur at larger timescales than those used in the training data. Finally, in the case of families of systems governed by a parameterized family of interaction kernels, we introduce novel estimators that estimate the parameterized family of kernels, splitting it into a common interaction kernel and the action of parameters. We demonstrate this in the case of gravity, by learning both the ``common component&#39;&#39; $1/r^2$ and the dependency on mass, without any a priori knowledge of either one, from observations of planetary motions in our solar system.

preprint2020arXiv

Extractive Summarization as Text Matching

This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems. Instead of following the commonly used framework of extracting sentences individually and modeling the relationship between sentences, we formulate the extractive summarization task as a semantic text matching problem, in which a source document and candidate summaries will be (extracted from the original text) matched in a semantic space. Notably, this paradigm shift to semantic matching framework is well-grounded in our comprehensive analysis of the inherent gap between sentence-level and summary-level extractors based on the property of the dataset. Besides, even instantiating the framework with a simple form of a matching model, we have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1). Experiments on the other five datasets also show the effectiveness of the matching framework. We believe the power of this matching-based summarization framework has not been fully exploited. To encourage more instantiations in the future, we have released our codes, processed dataset, as well as generated summaries in https://github.com/maszhongming/MatchSum.

preprint2019arXiv

Influences of weak disorder on dynamical quantum phase transitions of anisotropic XY chain

In this paper, the effects of disorder on the dynamical quantum phase transitions (DQPTs) in the transverse-field anisotropic XY chain are studied by numerically calculating the Loschmidt echo after quench. We obtain the formula for calculating the Loschmidt echo of the inhomogeneous system in real space. By comparing the results with that of the homogeneous chain, we find that when the quench crosses the Ising transition, the small disorder will cause a new critical point. As the disorder increases, more critical points of the DQPTs will occur, constituting a critical region. In the quench across the anisotropic transition, the disorder will cause a critical region near the critical point, and the width of the critical region increases by the disordered strength. In the case of quench passing through two critical lines, the small disorder leads to the system to have three additional critical points. When the quench is in the ferromagnetic phase, the large disorder causes the two critical points of the homogeneous case to become a critical region. And for the quench in the paramagnetic phase, the DQPTs will disappear for large disorder.

preprint2019arXiv

Nonparametric inference of interaction laws in systems of agents from trajectory data

Inferring the laws of interaction between particles and agents in complex dynamical systems from observational data is a fundamental challenge in a wide variety of disciplines. We propose a non-parametric statistical learning approach to estimate the governing laws of distance-based interactions, with no reference or assumption about their analytical form, from data consisting trajectories of interacting agents. We demonstrate the effectiveness of our learning approach both by providing theoretical guarantees, and by testing the approach on a variety of prototypical systems in various disciplines. These systems include homogeneous and heterogeneous agents systems, ranging from particle systems in fundamental physics to agent-based systems modeling opinion dynamics under the social influence, prey-predator dynamics, flocking and swarming, and phototaxis in cell dynamics.

preprint2019arXiv

The effects of KSEA interaction on the ground-state properties of spin chains in a transverse field

The effects of symmetric helical interaction which is called the Kaplan, Shekhtman, Entin-Wohlman, and Aharony (KSEA) interaction on the ground-state properties of three kinds of spin chains in a transverse field have been studied by means of correlation functions and chiral order parameter. We find that the anisotropic transition of $XY$ chain in a transverse field ($XY$TF) disappears because of the KSEA interaction. For the other two chains, we find that the regions of gapless chiral phases in the parameter space induced by the DM or $XZY-YZX$ type of three-site interaction are decreased gradually with increase of the strength of KSEA interaction. When it is larger than the coefficient of DM or $XZY-YZX$ type of three-site interaction, the gapless chiral phases also disappear.