Paper detail

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.

preprint2026arXivOpen access

Guijin Son Seungone Kim Catherine Arnett Hyunwoo Ko Hyein Lee Hyeonah Kang Jiang Longxi Jin Yun JungYup Lee Kyungmin Lee Sam Yoosuk Kim Sang Park Seunghyeok Hong SeungJae Lee Seungyeop Yi Shinae Shin SunHye Bok Sunyoung Shin Yonghoon Ji Youngtaek Kim Hanearl Jung Akari Asai Graham Neubig Sean Welleck Youngjae Yu Akshelin R Alexander B. Ivanov Boboev Muhammadjon Chae Young Han Christian Stump Cooper R. Anderson Dmitrii Karp Dohyun Kwon Dongryung Yi DoYong Kwon Duk-Soon Oh Eunho Choi Giovanni Resta Greta Panova Huiyun Noh Hyungryul Baik Hyungsun Bae Inomov Mashrafdzhon Jeewon Kim Jeong-Rae Kim Ji Eun Lee Jiaqi Liu Jieui Kang Jimin Kim Jon-Lark Kim Joonyeong Won Junseo Yoon Junwoo Jo Kibeom Kim Kiwoon Kwon Mario Kummer Max Mercer Min Hoon Kim Minjun Kim Nahyun Lee Ng Ze-An Nicolas Libedinsky Rafał Marcin Łochowski Raphaël Lachièze-Rey Robert Auffarth Ruichen Zhang Sejin Park Seonguk Seo Shin Jaehoon Sunatullo Taewoong Eom Yeachan Park Yongseok Jang Youchan Oh Zhaoyang Wang Zoltán Kovács

Computation and Language

Open graph Reviews Discussion

Signal facts

What is known right now

Open access76 authors1 topic

Imported metadata coverageMissing code, dataset, citation and institution fields are tracked without dominating the paper.Details

Citations: 0Reviews: 0Saves: 0Code: not linkedDataset: not linkedInstitutions: 0

Next steps

Decide what to do with this paper

Like0 Dislike0Score 0

Use like or dislike for the fast social read. The more specific scholarly feedback stays available below when needed.

Save to reading list0

Keep the important signals around this paper in one place: votes, save state, collection context, reviews and the metadata you need before deciding what to do next.

Authors

Institutions

No institution affiliation has been imported for this paper yet.

Add specific reaction

Move through nearby people, institutions, topics and adjacent work without leaving the paper page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

ContributeLeave structured feedbackUse the review template when you have a concrete strength, concern or method question.Open review form

No structured reviews yet. High-signal critique starts here.

DiscussAdd a high-signal commentKeep quick notes, caveats and replication pointers separate from formal reviews.Open comment form

No discussion yet. The first strong comment sets the tone.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

What is known right now

Decide what to do with this paper

Keep the important context close to the paper

Authors

Institutions

Research map

Building this map preview

0 review(s)

0 comment(s)