Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
11works
0followers
14topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

11 published item(s)

preprint2026arXiv

FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration

LLM-based evolution has emerged as a promising way to improve agents by refining non-parametric artifacts, but its wall-clock cost remains a major bottleneck. We identify that this cost comes from synchronized stage execution and imbalance inside each LLM-heavy stage. We present FlashEvolve, an efficient framework that replaces synchronized execution with asynchronous workers and queues, allowing different stages and steps to overlap. To handle data staleness introduced by asynchrony, FlashEvolve tracks artifact versions and applies different policies to update, discard, or patch stale artifacts. Unlike weight-space staleness in asynchronous RL, language-space staleness is inspectable and repairable: a stale artifact is not just delayed work, but readable evidence that the LLM can reflect on, revise, and turn into useful evolution signal. FlashEvolve further improves throughput and token efficiency with speculative stage completion and adaptive workflow control. On GEPA workloads, FlashEvolve improves proposal throughput by $3.5\times$ on local vLLM and $4.9\times$ on API serving over synchronous GEPA. The same design also applies to ACE and Meta-Harness.

preprint2025arXiv

Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding

Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to $3.98\times$ speedup over state-of-the-art baselines across multiple hardware setups.

preprint2022arXiv

Block-Skim: Efficient Question Answering for Transformer

Transformer models have achieved promising results on natural language processing (NLP) tasks including extractive question answering (QA). Common Transformer encoders used in NLP tasks process the hidden states of all input tokens in the context paragraph throughout all layers. However, different from other tasks such as sequence classification, answering the raised question does not necessarily need all the tokens in the context paragraph. Following this motivation, we propose Block-skim, which learns to skim unnecessary context in higher hidden layers to improve and accelerate the Transformer performance. The key idea of Block-Skim is to identify the context that must be further processed and those that could be safely discarded early on during inference. Critically, we find that such information could be sufficiently derived from the self-attention weights inside the Transformer model. We further prune the hidden states corresponding to the unnecessary positions early in lower layers, achieving significant inference-time speedup. To our surprise, we observe that models pruned in this way outperform their full-size counterparts. Block-Skim improves QA models' accuracy on different datasets and achieves 3 times speedup on BERT-base model.

preprint2022arXiv

Transkimmer: Transformer Learns to Layer-wise Skim

Transformer architecture has become the de-facto model for many machine learning tasks from natural language processing and computer vision. As such, improving its computational efficiency becomes paramount. One of the major computational inefficiency of Transformer-based models is that they spend the identical amount of computation throughout all layers. Prior works have proposed to augment the Transformer model with the capability of skimming tokens to improve its computational efficiency. However, they suffer from not having effectual and end-to-end optimization of the discrete skimming predictor. To address the above limitations, we propose the Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer. The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers. The key idea in Transkimmer is to add a parameterized predictor before each layer that learns to make the skimming decision. We also propose to adopt reparameterization trick and add skim loss for the end-to-end training of Transkimmer. Transkimmer achieves 10.97x average speedup on GLUE benchmark compared with vanilla BERT-base baseline with less than 1% accuracy degradation.

preprint2021arXiv

Meshless Fragile Points Methods Based on Petrov-Galerkin Weak-Forms for Transient Heat Conduction Problems in Complex Anisotropic Nonhomogeneous Media

Three kinds of Fragile Points Methods based on Petrov-Galerkin weak-forms (PG-FPMs) are proposed for analyzing heat conduction problems in nonhomogeneous anisotropic media. This is a follow-up of the previous study on the original FPM based on a symmetric Galerkin weak-form. The trial function is piecewise-continuous, written as local Taylor expansions at the Fragile Points. A modified Radial Basis Function-based Differential Quadrature (RBF-DQ) method is employed for establishing the local approximation. The Dirac delta function, Heaviside step function, and the local fundamental solution of the governing equation are alternatively used as test functions. Vanishing or pure contour integral formulation in subdomains or on local boundaries can be obtained. Extensive numerical examples in 2D and 3D are provided as validations. The collocation method (PG-FPM-1) is superior in transient analysis with arbitrary point distribution and domain partition. The finite volume method (PG-FPM-2) shows the best efficiency, saving 25% to 50% computational time comparing with the Galerkin FPM. The singular solution method (PG-FPM-3) is highly efficient in steady-state analysis. The anisotropy and nonhomogeneity give rise to no difficulties in all the methods. The proposed PG-FPM approaches represent an improvement to the original Galerkin FPM, as well as to other meshless methods in earlier literature.

preprint2020arXiv

A New Meshless "Fragile Points Method (FPM)" Based on A Galerkin Weak-Form for 2D Flexoelectric Analysis

A meshless Fragile Points Method (FPM) is presented for analyzing 2D flexoelectric problems. Local, simple, polynomial and discontinuous trial and test functions are generated with the help of a local meshless differential quadrature approximation of the first three derivatives. Interior Penalty Numerical Fluxes are employed to ensure the consistency of the method. Based on a Galerkin weak-form formulation, the present FPM leads to symmetric and sparse matrices, and avoids the difficulties of numerical integration in the previous meshfree methods. Numerical examples including isotropic and anisotropic materials with flexoelectric and piezoelectric effects are provided as validations. The present method is much simpler than the Finite Element Method, or the Element-Free Galerkin (EFG) and Meshless Local Petrov-Galerkin (MLPG) methods, and the numerical integration of the weak form is trivially simple.

preprint2020arXiv

A New Meshless "Fragile Points Method" and A Local Variational Iteration Method for General Transient Heat Conduction in Anisotropic Nonhomogeneous Media

A new and effective computational approach is presented for analyzing transient heat conduction problems. The approach consists of a meshless Fragile Points Method (FPM) being utilized for spatial discretization, and a Local Variational Iteration (LVI) scheme for time discretization. Anisotropy and nonhomogeneity do not give rise to any difficulties in the present implementation. The meshless FPM is based on a Galerkin weak-form formulation and thus leads to symmetric matrices. Local, very simple, polynomial and discontinuous trial and test functions are employed. In the meshless FPM, Interior Penalty Numerical Fluxes are introduced to ensure the consistency of the method. The LVIM in the time domain is generated as a combination of the Variational Iteration Method (VIM) applied over a large time interval and numerical algorithms. A set of collocation nodes are employed in each finitely large time interval. The FPM + LVIM approach is capable of solving transient heat transfer problems in complex geometries with mixed boundary conditions, including pre-existing cracks. Numerical examples are presented in 2D and 3D domains. Both functionally graded materials and composite materials are considered. It is shown that, with suitable computational parameters, the FPM + LVIM approach is not only accurate, but also efficient, and has reliable stability under relatively large time intervals. The present methodology represents a considerable improvement to the current state of science in computational transient heat conduction in anisotropic nonhomogeneous media.

preprint2020arXiv

Accelerating Sparse DNN Models without Hardware-Support via Tile-Wise Sparsity

Network pruning can reduce the high computation cost of deep neural network (DNN) models. However, to maintain their accuracies, sparse models often carry randomly-distributed weights, leading to irregular computations. Consequently, sparse models cannot achieve meaningful speedup on commodity hardware (e.g., GPU) built for dense matrix computations. As such, prior works usually modify or design completely new sparsity-optimized architectures for exploiting sparsity. We propose an algorithm-software co-designed pruning method that achieves latency speedups on existing dense architectures. Our work builds upon the insight that the matrix multiplication generally breaks the large matrix into multiple smaller tiles for parallel execution. We propose a tiling-friendly "tile-wise" sparsity pattern, which maintains a regular pattern at the tile level for efficient execution but allows for irregular, arbitrary pruning at the global scale to maintain the high accuracy. We implement and evaluate the sparsity pattern on GPU tensor core, achieving a 1.95x speedup over the dense model.

preprint2020arXiv

Bounded-Rational Pursuit-Evasion Games

We present a framework that incorporates the idea of bounded rationality into dynamic stochastic pursuit-evasion games. The solution of a stochastic game is characterized, in general, by its (Nash) equilibria in feedback form. However, computing these Nash equilibrium strategies may require extensive computational resources. In this paper, the agents are modeled as bounded rational entities having limited computational resources. We illustrate the framework by applying it to a pursuit-evasion game between two vehicles in a stochastic wind field, where both the pursuer and the evader are bounded rational. We show how such a game may be analyzed by properly casting it as an iterative sequence of finite-state Markov Decision Processes (MDPs). Leveraging tools and algorithms from cognitive hierarchy theory ("level-$k$ thinking") we compute the solution of the ensuing discrete game, while taking into consideration the rationality level of each agent. We also present an online algorithm for each agent to infer its opponent rationality level.

preprint2020arXiv

Towards Dynamic Pricing for Shared Mobility on Demand using Markov Decision Processes and Dynamic Programming

In a Shared Mobility on Demand Service (SMoDS), dynamic pricing plays an important role in the form of an incentive for the decision of the empowered passenger on the ride offer. Strategies for determining the dynamic tariff should be suitably designed so that the incurred demand and supply are balanced and therefore economic efficiency is achieved. In this manuscript, we formulate a discrete time Markov Decision Process (MDP) to determine the probability desired by the SMoDS platform corresponding to the acceptance rate of each empowered passenger at each state of the system. We use Estimated Waiting Time (EWT) as the metric for the balance between demand and supply, with the goal that EWT be regulated around a target value. We then develop a Dynamic Programming (DP) algorithm to derive the optimal policy of the MDP that regulates EWT around the target value. Computational experiments are conducted that demonstrate the regulation of EWT is effective, through various scenarios. The overall demonstration is carried out offline. The MDP formulation together with the DP algorithm can be utilized to an online determination of the dynamic tariff by integrating with our earlier works on Cumulative Prospect Theory based passenger behavioral modeling and the AltMin dynamic routing algorithm, and form the subject of future works.