Source author record

Shahin Shayandeh

Shahin Shayandeh appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computation and Language Machine Learning Discrete Mathematics Multiagent Systems Networking and Internet Architecture

Catalog footprint

What is connected

5works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently post-hoc. Disconnected from the active execution loop, such assessments identify errors that are usually addressed through prompt-tuning or retraining, and fundamentally cannot course-correct the agent in real time. To close this gap, we move evaluation into the execution loop at inference time: a specialized reviewer agent evaluates provisional tool calls prior to execution, shifting the paradigm from post-hoc recovery to proactive evaluation and error mitigation. In practice, this architecture establishes a clear separation of concerns between the primary execution agent and a secondary review agent. As with any multi-agent system, the reviewer can introduce new errors while correcting others, yet no prior work to our knowledge has systematically measured this tradeoff. To quantify this tradeoff, we introduce Helpfulness-Harmfulness metrics: helpfulness measures the percentage of base agent errors that feedback corrects; harmfulness measures the percentage of correct responses that feedback degrades. These metrics directly inform reviewer design by revealing whether a given model or prompt provides net positive value. We evaluate our approach on BFCL (single-turn) and Tau2-Bench (multi-turn stateful scenarios), achieving +5.5% on irrelevance detection and +7.1% on multi-turn tasks. Our metrics reveal that reviewer model choice is critical: the reasoning model o3-mini achieves a 3:1 benefit-to-risk ratio versus 2.1:1 for GPT-4o. Automated prompt optimization via GEPA provides an additional +1.5-2.8%. Together, these results demonstrate a core advantage of separating execution and review: the reviewer can be systematically improved through model selection and prompt optimization, without retraining the base agent.

preprint2020arXiv

Conversation Learner -- A Machine Teaching Tool for Building Dialog Managers for Task-Oriented Dialog Systems

Traditionally, industry solutions for building a task-oriented dialog system have relied on helping dialog authors define rule-based dialog managers, represented as dialog flows. While dialog flows are intuitively interpretable and good for simple scenarios, they fall short of performance in terms of the flexibility needed to handle complex dialogs. On the other hand, purely machine-learned models can handle complex dialogs, but they are considered to be black boxes and require large amounts of training data. In this demonstration, we showcase Conversation Learner, a machine teaching tool for building dialog managers. It combines the best of both approaches by enabling dialog authors to create a dialog flow using familiar tools, converting the dialog flow into a parametric model (e.g., neural networks), and allowing dialog authors to improve the dialog manager (i.e., the parametric model) over time by leveraging user-system dialog logs as training data through a machine teaching interface.

preprint2020arXiv

Guided Dialog Policy Learning without Adversarial Learning in the Loop

Reinforcement Learning (RL) methods have emerged as a popular choice for training an efficient and effective dialogue policy. However, these methods suffer from sparse and unstable reward signals returned by a user simulator only when a dialogue finishes. Besides, the reward signal is manually designed by human experts, which requires domain knowledge. Recently, a number of adversarial learning methods have been proposed to learn the reward function together with the dialogue policy. However, to alternatively update the dialogue policy and the reward model on the fly, we are limited to policy-gradient-based algorithms, such as REINFORCE and PPO. Moreover, the alternating training of a dialogue agent and the reward model can easily get stuck in local optima or result in mode collapse. To overcome the listed issues, we propose to decompose the adversarial training into two steps. First, we train the discriminator with an auxiliary dialogue generator and then incorporate a derived reward model into a common RL method to guide the dialogue policy learning. This approach is applicable to both on-policy and off-policy RL methods. Based on our extensive experimentation, we can conclude the proposed method: (1) achieves a remarkable task success rate using both on-policy and off-policy RL methods; and (2) has the potential to transfer knowledge from existing domains to a new domain.

preprint2020arXiv

Robust Conversational AI with Grounded Text Generation

This article presents a hybrid approach based on a Grounded Text Generation (GTG) model to building robust task bots at scale. GTG is a hybrid model which uses a large-scale Transformer neural network as its backbone, combined with symbol-manipulation modules for knowledge base inference and prior knowledge encoding, to generate responses grounded in dialog belief state and real-world knowledge for task completion. GTG is pre-trained on large amounts of raw text and human conversational data, and can be fine-tuned to complete a wide range of tasks. The hybrid approach and its variants are being developed simultaneously by multiple research teams. The primary results reported on task-oriented dialog benchmarks are very promising, demonstrating the big potential of this approach. This article provides an overview of this progress and discusses related methods and technologies that can be incorporated for building robust conversational AI systems.

preprint2011arXiv

You Share, I Share: Network Effects and Economic Incentives in P2P File-Sharing Systems

We study the interaction between network effects and external incentives on file sharing behavior in Peer-to-Peer (P2P) networks. Many current or envisioned P2P networks reward individuals for sharing files, via financial incentives or social recognition. Peers weigh this reward against the cost of sharing incurred when others download the shared file. As a result, if other nearby nodes share files as well, the cost to an individual node decreases. Such positive network sharing effects can be expected to increase the rate of peers who share files. In this paper, we formulate a natural model for the network effects of sharing behavior, which we term the "demand model." We prove that the model has desirable diminishing returns properties, meaning that the network benefit of increasing payments decreases when the payments are already high. This result holds quite generally, for submodular objective functions on the part of the network operator. In fact, we show a stronger result: the demand model leads to a "coverage process," meaning that there is a distribution over graphs such that reachability under this distribution exactly captures the joint distribution of nodes which end up sharing. The existence of such distributions has advantages in simulating and estimating the performance of the system. We establish this result via a general theorem characterizing which types of models lead to coverage processes, and also show that all coverage processes possess the desirable submodular properties. We complement our theoretical results with experiments on several real-world P2P topologies. We compare our model quantitatively against more naïve models ignoring network effects. A main outcome of the experiments is that a good incentive scheme should make the reward dependent on a node's degree in the network.

Shahin Shayandeh

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

Conversation Learner -- A Machine Teaching Tool for Building Dialog Managers for Task-Oriented Dialog Systems

Guided Dialog Policy Learning without Adversarial Learning in the Loop

Robust Conversational AI with Grounded Text Generation

You Share, I Share: Network Effects and Economic Incentives in P2P File-Sharing Systems