Source author record

Murtuza N. Shergadwala

Murtuza N. Shergadwala appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Human-Computer Interaction Software Engineering

Catalog footprint

What is connected

2works

3topics

1close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

The Stability Trap: Evaluating the Reliability of LLM-Based Instruction Adherence Auditing

The enterprise governance of Generative AI (GenAI) in regulated sectors, such as Human Resources (HR), demands scalable yet reproducible auditing mechanisms. While Large Language Model (LLM)-as-a-Judge approaches offer scalability, their reliability in evaluating adherence of different types of system instructions remains unverified. This study asks: To what extent does the instruction type of an Application Under Test (AUT) influence the stability of judge evaluations? To address this, we introduce the Scoped Instruction Decomposition Framework to classify AUT instructions into Objective and Subjective types, isolating the factors that drive judge instability. We applied this framework to two representative HR GenAI applications, evaluating the stability of four judge architectures over variable runs. Our results reveal a ``Stability Trap'' characterized by a divergence between Verdict Stability and Reasoning Stability. While judges achieved near-perfect verdict agreement ($>99\%$) for both objective and subjective evaluations, their accompanying justification traces diverged significantly. Objective instructions requiring quantitative analysis, such as word counting, exhibited reasoning stability as low as $\approx19\%$, driven by variances in numeric justifications. Similarly, reasoning stability for subjective instructions varied widely ($35\%$--$83\%$) based on evidence granularity, with feature-specific checks failing to reproduce consistent rationale. Conversely, objective instructions focusing on discrete entity extraction achieved high reasoning stability ($>90\%$). These findings demonstrate that high verdict stability can mask fragile reasoning. Thus, we suggest that auditors scope automated evaluation protocols strictly: delegate all deterministically verifiable logic to code, while reserving LLM judges for complex semantic evaluation.

preprint2021arXiv

Esports Agents with a Theory of Mind: Towards Better Engagement, Education, and Engineering

The role of AI in esports is shifting from leveraging games as a testbed for improving AI algorithms to addressing the needs of the esports players such as enhancing their gaming experience, esports skills, and providing coaching. For AI to be able to effectively address such needs in esports, AI agents require a theory of mind, that is, the ability to infer players' tactics and intents. To that end, in this position paper, we argue for human-in-the-loop approaches for the discovery and computational embedding of the theory of mind within behavioral models of esports players. We discuss that such approaches can be enabled by player-centric investigations on situated cognition that will expand our understanding of the cognitive and other unobservable factors that influence esports players' behaviors. We conclude by discussing the implications of such a research direction in esports as well as broader implications in engineering design and design education.