Source author record

Tianyin Xu

Tianyin Xu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering Artificial Intelligence eess.SY Human-Computer Interaction Systems and Control

Catalog footprint

What is connected

3works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

AI agents are increasingly used to diagnose and mitigate failures in production systems, known as agentic Site Reliability Engineering (SRE). Current SRE benchmarks are limited to oversimplistic SRE tasks and are unfortunately hard to extend due to bespoke designs. We present SREGym, a high-fidelity benchmark for SRE agents. SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. SREGym models the complexity of production environments by simulating (1) a wide range of faults at different layers, (2) various ambient noises, and (3) diverse failure modes such as metastable failures and correlated failures. SREGym is architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks. SREGym currently includes 90 realistic, challenging SRE problems. We use SREGym to evaluate frontier agents and show that their capabilities varies significantly in addressing different kinds of failures, with up to 40% differences in end-to-end results. SREGym is actively maintained as an open-source project and has been used by researchers and practitioners.

preprint2021arXiv

An Evolutionary Study of Configuration Design and Implementation in Cloud Systems

Many techniques were proposed for detecting software misconfigurations in cloud systems and for diagnosing unintended behavior caused by such misconfigurations. Detection and diagnosis are steps in the right direction: misconfigurations cause many costly failures and severe performance issues. But, we argue that continued focus on detection and diagnosis is symptomatic of a more serious problem: configuration design and implementation are not yet first-class software engineering endeavors in cloud systems. Little is known about how and why developers evolve configuration design and implementation, and the challenges that they face in doing so. This paper presents a source-code level study of the evolution of configuration design and implementation in cloud systems. Our goal is to understand the rationale and developer practices for revising initial configuration design/implementation decisions, especially in response to consequences of misconfigurations. To this end, we studied 1178 configuration-related commits from a 2.5 year version-control history of four large-scale, actively-maintained open-source cloud systems (HDFS, HBase, Spark, and Cassandra). We derive new insights into the software configuration engineering process. Our results motivate new techniques for proactively reducing misconfigurations by improving the configuration design and implementation process in cloud systems. We highlight a number of future research directions.

preprint2016arXiv

An HCI View of Configuration Problems

In recent years, configuration problems have drawn tremendous attention because of their increasing prevalence and their big impact on system availability. We believe that many of these problems are attributable to today's configuration interfaces that have not evolved to accommodate the enormous shift of the system administrator group. Plain text files, as the de facto configuration interfaces, assume administrators' understanding of the system under configuration. They ask administrators to directly edit the corresponding entries with little guidance or assistance. However, this assumption no longer holds for todays administrator group which has expanded greatly to include non- and semi-professional administrators. In this paper, we provide an HCI view of today's configuration problems, and articulate system configuration as a new HCI problem. Moreover, we present the top obstacles to correctly and efficiently configuring software systems, and most importantly their implications on the design and implementation of new-generation configuration interfaces.