AcademiClaw: When Students Set Challenges for AI Agents
Abstract Overview
AcademiClaw is a bilingual benchmark of 80 complex, long-horizon tasks within the OpenClaw agent ecosystem, sourced from university students' real academic workflows that current AI agents were unable to solve effectively. The tasks were curated from 230 student-submitted candidates through expert review and span more than 25 professional domains, including competition-level mathematics, GPU-intensive reinforcement learning, and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task runs in an isolated Docker sandbox and is evaluated with multi-dimensional rubrics combining six verification techniques, alongside a five-category safety audit and full trajectory logging. Experiments on six frontier models show that even the best achieves only a 55% pass rate, confirming that academic-level tasks remain substantially challenging for current agents.
Novelty
AcademiClaw is presented as the first academic-level benchmark in the OpenClaw ecosystem and the first agent benchmark whose tasks originate entirely from university students rather than researchers or annotators. Its distinctive aspects include authentic student-originated tasks from real academic workflows, bilingual coverage with natively Chinese tasks requiring culturally grounded competence, GPU-intensive tasks absent from all prior OpenClaw benchmarks, and a six-method evaluation framework paired with five-category safety auditing.
Results
Among six frontier models evaluated, Claude Opus 4.6 achieved the highest average score of 71.9 and shares the top pass rate of 55.0% with Claude Sonnet 4.6, while 23 of 80 tasks defeated all six models. Analysis reveals that cross-category difficulty variation (26.3-point gap) far exceeds cross-model variation (8.8-point gap), token consumption shows near-zero correlation with task score (r = −0.03), and three distinct behavioral phenotypes (read-first, execute-first, minimalist) emerge with divergent efficiency and safety profiles. Boundary compliance (S3) is identified as the decisive safety dimension, with a 53-point gap between the safest and least safe models.
Key Points
- AcademiClaw contains 80 bilingual, long-horizon academic tasks curated from 230 student-submitted real-world problems across 25+ domains, including 16 tasks requiring CUDA GPU execution, making it the first GPU-inclusive benchmark in the OpenClaw ecosystem.
- The benchmark evaluates agents in isolated Docker environments using task-specific multi-dimensional rubrics (3–6 scoring dimensions per task) built from six complementary verification techniques, plus five-category safety auditing and trajectory logging.
- Evaluation of six frontier models reveals limited success (top pass rate 55.0%, 23 tasks unsolved by all models), strong category-dependent difficulty where cross-category variation exceeds cross-model variation, and near-zero correlation between token consumption and output quality (r = −0.03).
References
- arXiv: https://arxiv.org/abs/2605.02661v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2605.02661v1
- Hugging Face Papers: https://huggingface.co/papers/2605.02661
- GitHub: https://github.com/GAIR-NLP/AcademiClaw