Fugu-MT 論文翻訳(概要): CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

論文の概要: CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

arxiv url: http://arxiv.org/abs/2602.12268v1
Date: Thu, 12 Feb 2026 18:55:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-13 21:07:25.992818
Title: CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use
Title（参考訳）: CM2:マルチTurnおよびマルチステップエージェントツール用チェックリストリワードによる強化学習
Authors: Zhen Zhang, Kaiqiang Song, Xun Wang, Yebowen Hu, Weixiang Yan, Chenyang Zhao, Henry Peng Zou, Haoyun Deng, Sathish Reddy Indurthi, Shujian Liu, Simin Ma, Xiaoyang Wang, Xin Eric Wang, Song Wang,
Abstract要約: 検証可能な結果報酬をチェックリスト報酬に置き換えるRLフレームワークであるCM2を提案する。 CM2は、各ターンの意図した振る舞いを、明確な証拠と構造化されたメタデータで、きめ細かいバイナリの基準に分解する。 CM2は教師付き微調整よりも一貫して改善されている。
参考スコア（独自算出の注目度）: 46.31709172579914
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI agents are increasingly used to solve real-world tasks by reasoning over multi-turn user interactions and invoking external tools. However, applying reinforcement learning to such settings remains difficult: realistic objectives often lack verifiable rewards and instead emphasize open-ended behaviors; moreover, RL for multi-turn, multi-step agentic tool use is still underexplored; and building and maintaining executable tool environments is costly, limiting scale and coverage. We propose CM2, an RL framework that replaces verifiable outcome rewards with checklist rewards. CM2 decomposes each turn's intended behavior into fine-grained binary criteria with explicit evidence grounding and structured metadata, turning open-ended judging into more stable classification-style decisions. To balance stability and informativeness, our method adopts a strategy of sparse reward assignment but dense evaluation criteria. Training is performed in a scalable LLM-simulated tool environment, avoiding heavy engineering for large tool sets. Experiments show that CM2 consistently improves over supervised fine-tuning. Starting from an 8B Base model and training on an 8k-example RL dataset, CM2 improves over the SFT counterpart by 8 points on tau^-Bench, by 10 points on BFCL-V4, and by 12 points on ToolSandbox. The results match or even outperform similarly sized open-source baselines, including the judging model. CM2 thus provides a scalable recipe for optimizing multi-turn, multi-step tool-using agents without relying on verifiable rewards. Code provided by the open-source community: https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.
Abstract（参考訳）: AIエージェントは、マルチターンユーザーインタラクションを推論し、外部ツールを呼び出すことで、現実世界のタスクを解決するためにますます使われています。しかし、そのような環境に強化学習を適用することは困難であり、現実的な目的は検証可能な報酬を欠くことが多く、代わりにオープンエンドな行動を強調すること、さらにマルチターンで多段階のエージェントツール使用のためのRLはまだ未熟であり、実行可能なツール環境の構築と維持にはコストがかかり、スケールとカバレッジが制限される。検証可能な結果報酬をチェックリスト報酬に置き換えるRLフレームワークであるCM2を提案する。 CM2は、各ターンの意図する振る舞いを、明確な証拠と構造化されたメタデータによって、より安定した分類スタイルの判断へと分解する。安定度と情報度を両立させるため,提案手法はスパース報酬割り当ての戦略を採用するが,評価基準は厳密である。大規模ツールセットのヘビーエンジニアリングを回避するため、スケーラブルなLLMシミュレーションツール環境でトレーニングが行われる。 CM2は教師付き微調整よりも一貫して改善されている。 8Bベースモデルと8kサンプルのRLデータセットのトレーニングから、CM2は、タウ^-ベンチで8ポイント、BFCL-V4で10ポイント、ToolSandboxで12ポイント改善した。結果は、判定モデルを含む、同じ大きさのオープンソースベースラインにマッチするか、あるいは上回っている。したがってCM2は、検証可能な報酬に頼ることなく、マルチターン、マルチステップのツール使用エージェントを最適化するためのスケーラブルなレシピを提供する。 https://github.com/namezhenzhang/CM2-RLCR-Tool-Agent.com

論文の概要: CM2: Reinforcement Learning with Checklist Rewards for Multi-Turn and Multi-Step Agentic Tool Use

関連論文リスト