Fugu-MT 論文翻訳(概要): LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

論文の概要: LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

arxiv url: http://arxiv.org/abs/2604.13072v1
Date: Fri, 20 Mar 2026 16:08:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.658624
Title: LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
Title（参考訳）: LiveClawBench: 複雑で実世界のアシスタントタスクでLLMエージェントをベンチマークする
Authors: Xiang Long, Li Du, Yilong Xu, Fangcheng Liu, Haoqing Wang, Ning Ding, Ziheng Li, Jianyuan Guo, Yehui Tang,
Abstract要約: 実世界のアシスタントタスク上でLLMエージェントを評価するベンチマークであるLiveClawBenchを紹介する。様々な実Clawの使用事例の分析に基づいて、三重軸複雑度フレームワークを導出する。我々は,実世界のアシスタントタスクをカバーする,明示的な複雑性要素アノテーションを用いたパイロットベンチマークを構築した。
参考スコア（独自算出の注目度）: 58.3639630490749
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM-based agents are increasingly expected to handle real-world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real-world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple-Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity-factor annotations, covering real-world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings, and establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. The project page is at https://github.com/Mosi-AI/LiveClawBench.
Abstract（参考訳）: LLMベースのエージェントは、現実のアシスタントタスクを扱うことがますます期待されているが、既存のベンチマークは通常、単一の環境や完全に指定された命令など、独立した困難さのソースでそれらを評価している。これにより、現在の評価設定と、実際のデプロイメントで発生する構成上の課題の間に、かなりのギャップが残されます。このギャップに対処するために、実世界のアシスタントタスク上でLLMエージェントを評価するベンチマークであるLiveClawBenchを紹介する。様々な実際のOpenClawのユースケースの分析に基づいて、環境複雑性、認知的要求、実行時適応性の3つの側面に沿ってタスクの難しさを特徴付ける3つのAxis Complexity Frameworkを導き出します。このフレームワークによってガイドされた我々は,現実のアシスタントタスクを構成困難でカバーする,明示的な複雑性要素アノテーションを用いたパイロットベンチマークを構築した。フレームワークとベンチマークは、現実的なアシスタント設定でLLMエージェントを評価するための原則化された基盤を提供し、タスクドメインと複雑性軸をまたいだ将来の拡張の基盤を確立する。私たちは、より包括的なドメインと複雑さのカバレッジを達成するために、ケースコレクションを充実させ続けています。プロジェクトページはhttps://github.com/Mosi-AI/LiveClawBench.comにある。

論文の概要: LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

関連論文リスト