ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
Abstract Overview
ClawMark is a benchmark for evaluating coworker-style language agents on multi-turn workflows that unfold across multiple in-universe working days. It places agents in a dynamic, stateful sandbox with five services—filesystem, email, calendar, knowledge base, and spreadsheet—where external state can change between turns through both announced and silent updates. The benchmark emphasizes raw multimodal evidence, including images, scanned PDFs, audio, video, and spreadsheets, and uses deterministic rule-based scoring rather than LLM-based judging. The current release contains 100 tasks across 13 professional scenarios, scored by 1,537 Python checkers over post-execution service state.
Novelty
The paper's main novelty is the combination of three evaluation properties usually separated in prior benchmarks: multi-day task structure with 2–6 turns per task, exogenous between-turn environment mutation (both announced and silent), and full multimodal office-style evidence delivered without pre-transcription. It is also distinctive in enforcing a no-LLM-as-judge protocol with deterministic checker-based verification and bit-identical re-run consistency requirements for release.
Results
Across seven frontier agent systems, the best weighted score is 75.8 (Claude Sonnet 4.6) while the best strict Task Success reaches only 20.0% (Claude Opus 4.6), indicating that partial progress is common but complete workflow completion remains rare. Turn-level analysis on the 73 three-turn tasks shows six of seven models drop in performance after the first exogenous environment update on Day 2, and failure-mode analysis reveals silent-change detection (56.5% fail rate) and backend writeback (53.6% fail rate) as the dominant failure categories.
Key Points
- ClawMark evaluates agents in evolving multi-day office workflows with exogenous between-turn state changes, rather than single static episodes, spanning 100 tasks across 13 professional scenarios.
- The benchmark uses five stateful sandboxed services and rule-based scoring with 1,537 deterministic Python checkers (including 55 red-line constraints), avoiding LLM-as-judge evaluation and requiring bit-identical verdicts across re-runs.
- Empirical results show substantial headroom: no model exceeds 75.8 weighted score, strict Task Success remains at most 20.0%, and the two dominant failure modes—silent-change detection (56.5% fail rate) and backend writeback (53.6% fail rate)—highlight adaptation to exogenous state changes as a key open challenge.