Paper detail

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .

preprint2026arXivOpen access

Mike A. Merrill Alexander G. Shaw Nicholas Carlini Boxuan Li Harsh Raj Ivan Bercovich Lin Shi Jeong Yeon Shin Thomas Walshe E. Kelly Buchanan Junhong Shen Guanghao Ye Haowei Lin Jason Poulos Maoyu Wang Marianna Nezhurina Jenia Jitsev Di Lu Orfeas Menis Mastromichalakis Zhiwei Xu Zizhao Chen Yue Liu Robert Zhang Leon Liangyu Chen Anurag Kashyap Jan-Lucas Uslu Jeffrey Li Jianbo Wu Minghao Yan Song Bian Vedang Sharma Ke Sun Steven Dillmann Akshay Anand Andrew Lanpouthakoun Bardia Koopah Changran Hu Etash Guha Gabriel H. S. Dreiman Jiacheng Zhu Karl Krauth Li Zhong Niklas Muennighoff Robert Amanfu Shangyin Tan Shreyas Pimpalgaonkar Tushar Aggarwal Xiangning Lin Xin Lan Xuandong Zhao Yiqing Liang Yuanli Wang Zilong Wang Changzhi Zhou David Heineman Hange Liu Harsh Trivedi John Yang Junhong Lin Manish Shetty Michael Yang Nabil Omi Negin Raoof Shanda Li Terry Yue Zhuo Wuwei Lin Yiwei Dai Yuxin Wang Wenhao Chai Shang Zhou Dariush Wahdany Ziyu She Jiaming Hu Zhikang Dong Yuxuan Zhu Sasha Cui Ahson Saiyed Arinbjörn Kolbeinsson Jesse Hu Christopher Michael Rytting Ryan Marten Yixin Wang Alex Dimakis Andy Konwinski Ludwig Schmidt

Software Engineering Artificial Intelligence

Open graph Reviews Discussion

Signal facts

What is known right now

Open access85 authors2 topics

Imported metadata coverageMissing code, dataset, citation and institution fields are tracked without dominating the paper.Details

Citations: 0Reviews: 0Saves: 0Code: not linkedDataset: not linkedInstitutions: 0

Next steps

Decide what to do with this paper

Like0 Dislike0Score 0

Use like or dislike for the fast social read. The more specific scholarly feedback stays available below when needed.

Save to reading list0

Keep the important signals around this paper in one place: votes, save state, collection context, reviews and the metadata you need before deciding what to do next.

Authors

Institutions

No institution affiliation has been imported for this paper yet.

Add specific reaction

Move through nearby people, institutions, topics and adjacent work without leaving the paper page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

ContributeLeave structured feedbackUse the review template when you have a concrete strength, concern or method question.Open review form

No structured reviews yet. High-signal critique starts here.

DiscussAdd a high-signal commentKeep quick notes, caveats and replication pointers separate from formal reviews.Open comment form

No discussion yet. The first strong comment sets the tone.

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

What is known right now

Decide what to do with this paper

Keep the important context close to the paper

Authors

Institutions

Research map

Building this map preview

0 review(s)

0 comment(s)