Fugu-MT 論文翻訳(概要): MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

論文の概要: MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

arxiv url: http://arxiv.org/abs/2605.06334v1
Date: Thu, 07 May 2026 14:26:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.896283
Title: MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents
Title（参考訳）: MANTRA: ツール利用 LLM エージェントのためのSMT-Validated Compliance Benchmarks の合成
Authors: Ashwani Anand, Ivi Chatzi, Ritam Raha, Anne-Kathrin Schmuck,
Abstract要約: MANTRAは、自然言語マニュアルとツールスキーマからマシンチェック可能なコンプライアンスベンチマークを自動的に合成するフレームワークである。我々は、6つのドメインにまたがる285のタスクを、最小限の人的労力で50ページのマニュアルにスケーリングする新しいベンチマークスイートを構築しました。
参考スコア（独自算出の注目度）: 0.815557531820863
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM-based judges, which either do not scale or lack reliability for complex, long-horizon manuals. To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine-checkable compliance benchmarks from natural-language manuals and tool schemas. MANTRA independently generates (i) a symbolic world model capturing procedural dependencies, and (ii) a set of trace-level compliance checks for a given task, and validates their consistency using SMT solving. A structured repair loop resolves inconsistencies, requiring human intervention only as a fallback. %This yields benchmarks that are formally validated. Importantly, MANTRA supports arbitrary domains and long procedural manuals, and provides a tunable notion of task complexity which is utilized to automatically derive challenging tasks accompanying compliance checks. Using MANTRA, we build a new benchmark suite with 285 tasks across 6 domains scaling to 50+ page manuals with minimal human effort. Empirically, we show that the compliance checks are richer with stronger constraint enforcement compared to existing benchmarks. Additionally, the granularity of the checks can be used for debugging the agents' failure modes. These results demonstrate that combining automated benchmark generation with formally grounded validation methods enables scalable and reliable benchmarking of tool-using agents.
Abstract（参考訳）: ツールを使用する大規模言語モデル(LLM)エージェントは、信頼性の高い振る舞いが厳格な手続きマニュアルによって管理されるような環境で、ますますデプロイされる。エージェントの振る舞いがツールコールの実行トレースとして現れているのに対して、このようなエージェントがこれらのマニュアルの規則に従うことを保証することは、一般的には自然言語で人間のために書かれたものであるため、難しい。 LLMエージェントの既存の評価は、手作業によるベンチマークやLSMベースの判断に頼っている。これらの制限を克服するために、自然言語マニュアルやツールスキーマから機械チェック可能なコンプライアンスベンチマークを自動的に合成するフレームワークであるMANTRAを提案する。 MANTRAは独立して生成する一手続上の依存関係を捉えた象徴的世界モデル (2) 与えられたタスクに対するトレースレベルのコンプライアンスチェックのセットを作成し、SMT解決を用いて一貫性を検証する。構造的修復ループは不整合を解消し、フォールバックとしてのみ人間の介入を必要とする。 % 正式に検証されたベンチマークが得られます。重要なことは、MANTRAは任意のドメインと長い手続きマニュアルをサポートし、コンプライアンスチェックに伴う課題を自動的に引き出すために使用されるタスク複雑性のチューニング可能な概念を提供する。 MANTRAを使って、6つのドメインにまたがる285のタスクからなる新しいベンチマークスイートを構築し、最小限の人的労力で50ページのマニュアルにスケーリングします。経験的に、コンプライアンスチェックは既存のベンチマークよりも強い制約執行によってリッチであることを示す。さらに、チェックの粒度はエージェントの障害モードのデバッグに使用することができる。これらの結果は,自動ベンチマーク生成と公式な根拠付き検証手法を組み合わせることで,ツール使用エージェントのスケーラブルで信頼性の高いベンチマークを可能にすることを実証している。

論文の概要: MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

関連論文リスト