FuguReport

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

Authors Fanqing Meng, Lingxiao Du, Zijian Wu, Guanzheng Chen, Xiangyan Liu, Jiaqi Liao, Chonghe Jiang, Zhenglin Wan, Jiawei Gu, Pengfei Zhou, Rui Huang, Ziqi Zhao, Shengyuan Ding, Ailing Yu, Bo Peng, Bowei Xia, Hao Sun, Haotian Liang, Ji Xie, Jiajun Chen, Jiajun Song, Liu Yang, Ming Xu, Qionglin Qiu, Runhao Fu, Shengfang Zhai, Shijian Wang, Tengfei Ma, Tianyi Wu, Weiyang Jin, Yan Wang, Yang Dai, Yao Lai, Youwei Shu, Yue Liu, Yunzhuo Hao, Yuwei Niu, Jinkai Huang, Jiayuan Zhuo, Zhennan Shen, Linyu Wu, Cihang Xie, Yuyin Zhou, Jiaheng Zhang, Zeyu Zheng, Mengkang Hu, Michael Qizhe Shieh
Affiliations Evolvent AI
Categories Evaluation / Benchmarking / Multi-turn multi-day task performance, Task / Interactive Agents / Coworker agent simulation, Application / Multimodal Systems / Professional scenario tasks
License CC BY 4.0

Abstract Overview

ClawMark is a benchmark for evaluating coworker-style language agents on multi-turn workflows that unfold across multiple in-universe working days. It places agents in a dynamic, stateful sandbox with five services—filesystem, email, calendar, knowledge base, and spreadsheet—where external state can change between turns through both announced and silent updates. The benchmark emphasizes raw multimodal evidence, including images, scanned PDFs, audio, video, and spreadsheets, and uses deterministic rule-based scoring rather than LLM-based judging. The current release contains 100 tasks across 13 professional scenarios, scored by 1,537 Python checkers over post-execution service state.

Novelty

The paper's main novelty is the combination of three evaluation properties usually separated in prior benchmarks: multi-day task structure with 2–6 turns per task, exogenous between-turn environment mutation (both announced and silent), and full multimodal office-style evidence delivered without pre-transcription. It is also distinctive in enforcing a no-LLM-as-judge protocol with deterministic checker-based verification and bit-identical re-run consistency requirements for release.

Results

Across seven frontier agent systems, the best weighted score is 75.8 (Claude Sonnet 4.6) while the best strict Task Success reaches only 20.0% (Claude Opus 4.6), indicating that partial progress is common but complete workflow completion remains rare. Turn-level analysis on the 73 three-turn tasks shows six of seven models drop in performance after the first exogenous environment update on Day 2, and failure-mode analysis reveals silent-change detection (56.5% fail rate) and backend writeback (53.6% fail rate) as the dominant failure categories.

Key Points

  1. ClawMark evaluates agents in evolving multi-day office workflows with exogenous between-turn state changes, rather than single static episodes, spanning 100 tasks across 13 professional scenarios.
  2. The benchmark uses five stateful sandboxed services and rule-based scoring with 1,537 deterministic Python checkers (including 55 red-line constraints), avoiding LLM-as-judge evaluation and requiring bit-identical verdicts across re-runs.
  3. Empirical results show substantial headroom: no model exceeds 75.8 weighted score, strict Task Success remains at most 20.0%, and the two dominant failure modes—silent-change detection (56.5% fail rate) and backend writeback (53.6% fail rate)—highlight adaptation to exogenous state changes as a key open challenge.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.