2026-05-04 Daily Report: AcademiClaw: When Students Set Challenges for AI Agents

AcademiClaw: When Students Set Challenges for AI Agents

Authors Junjie Yu, Pengrui Lu, Weiye Si, Hongliang Lu, Jiabao Wu, Kaiwen Tao, Kun Wang, Lingyu Yang, Qiran Zhang, Xiuting Guo, Xuanyu Wang, Yang Wang, Yanjie Wang, Yi Yang, Zijian Hu, Ziyi Yang, Zonghan Zhou, Binghao Qiang, Borui Zhang, Chenning Li, Enchang Zhang, Feifan Chen, Feng Jian, Fengyin Sun, Hao Qiu, Hao Zheng, Haoran Zhu, Hongyu Liu, Jianbin Deng, Jiaxin Song, Jiaying Chi, Jiayou Shi, Jie Fang, Jinghui Zhong, Jingyu Zhou, Jinze Li, Junfeng Yi, Junyan Yu, Junzhi Xue, Ni Song, Pengyi Chen, Qi Chen, Quansheng Li, Rui Tao, Shenghai Gong, Shenhang Lu, Tianqi Shen, Tianxiang Zhu, Tiehan Kang, Tingyu Li, Wendi Wu, Xiao Shen, Xiao Zhou, Xiaotao Zhang, Xinrong Li, Xuankun Yang, Xun Zhang, Yan Li, Ye Lu, Yi Wang, Yibo Zhou, Yichi Zhang, Yihao Sun, Yijun Huang, Yixin Zhu, Yixuan Wu, Yuchen Sun, Yue Wu, Yuheng Sun, Yukun Li, Yutian Tu, Yuxuan Qin, Yuzhuo Wu, Zeyu Li, Zhengyu Lou, Zhenning Ran, Zizhu He, Pengfei Liu

Affiliations Shanghai Jiao Tong University / GAIR / SII

Categories Evaluation / Benchmarking / Complex academic task benchmark, Task / Sequential Task / Long horizontal tasks, Application / Educational AI / Student-generated challenge tasks

License CC BY 4.0

Abstract Overview

AcademiClaw is a bilingual benchmark of 80 complex, long-horizon tasks within the OpenClaw agent ecosystem, sourced from university students' real academic workflows that current AI agents were unable to solve effectively. The tasks were curated from 230 student-submitted candidates through expert review and span more than 25 professional domains, including competition-level mathematics, GPU-intensive reinforcement learning, and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task runs in an isolated Docker sandbox and is evaluated with multi-dimensional rubrics combining six verification techniques, alongside a five-category safety audit and full trajectory logging. Experiments on six frontier models show that even the best achieves only a 55% pass rate, confirming that academic-level tasks remain substantially challenging for current agents.

Novelty

AcademiClaw is presented as the first academic-level benchmark in the OpenClaw ecosystem and the first agent benchmark whose tasks originate entirely from university students rather than researchers or annotators. Its distinctive aspects include authentic student-originated tasks from real academic workflows, bilingual coverage with natively Chinese tasks requiring culturally grounded competence, GPU-intensive tasks absent from all prior OpenClaw benchmarks, and a six-method evaluation framework paired with five-category safety auditing.

Results

Among six frontier models evaluated, Claude Opus 4.6 achieved the highest average score of 71.9 and shares the top pass rate of 55.0% with Claude Sonnet 4.6, while 23 of 80 tasks defeated all six models. Analysis reveals that cross-category difficulty variation (26.3-point gap) far exceeds cross-model variation (8.8-point gap), token consumption shows near-zero correlation with task score (r = −0.03), and three distinct behavioral phenotypes (read-first, execute-first, minimalist) emerge with divergent efficiency and safety profiles. Boundary compliance (S3) is identified as the decisive safety dimension, with a 53-point gap between the safest and least safe models.

Key Points

AcademiClaw contains 80 bilingual, long-horizon academic tasks curated from 230 student-submitted real-world problems across 25+ domains, including 16 tasks requiring CUDA GPU execution, making it the first GPU-inclusive benchmark in the OpenClaw ecosystem.
The benchmark evaluates agents in isolated Docker environments using task-specific multi-dimensional rubrics (3–6 scoring dimensions per task) built from six complementary verification techniques, plus five-category safety auditing and trajectory logging.
Evaluation of six frontier models reveals limited success (top pass rate 55.0%, 23 tasks unsolved by all models), strong category-dependent difficulty where cross-category variation exceeds cross-model variation, and near-zero correlation between token consumption and output quality (r = −0.03).

References

arXiv: https://arxiv.org/abs/2605.02661v1
Fugu-MT: https://fugumt.com/fugumt/paper_check/2605.02661v1
Hugging Face Papers: https://huggingface.co/papers/2605.02661
GitHub: https://github.com/GAIR-NLP/AcademiClaw

GitHub