Fugu-MT 論文翻訳(概要): Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

論文の概要: Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

arxiv url: http://arxiv.org/abs/2605.14537v1
Date: Thu, 14 May 2026 08:20:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.709085
Title: Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining
Title（参考訳）: 牛肉取引: LLMブラッフィング, ビディング, バリ取りのためのマルチエージェントベンチマーク
Authors: Robert Müller, Clemens Müller,
Abstract要約: 我々は,大規模言語モデル(LLM)を戦略的推論のエージェントとして評価するためのベンチマークであるtextscCattle Tradeを紹介した。このベンチマークは、1つのロングホライゾンゲーム内でのオークション、隠れオフのトレードチャレンジ(TC)、バーゲティング、ブラッフィング、相手モデリング、リソース割り当てを組み合わせたものだ。我々は,242ゲームに対して,コスト効率のよい7つの言語モデルと3つの決定論的コードエージェントを評価した。
参考スコア（独自算出の注目度）: 0.5691230599672109
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce \textsc{Cattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden-offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long-horizon game lasting 50--60 turns. Unlike prior agent benchmarks that test these abilities in isolation, \textsc{Cattle Trade} evaluates whether agents integrate them across a competitive, multi-agent economic game with conflicting incentives. The benchmark logs every bid, TC offer, counteroffer, and card selection, enabling behavioural analysis beyond final scores or win rates. We evaluate seven cost-efficient language models and three deterministic code agents across 242 games. Strategic coherence, in particular spending efficiency, resource discipline, and phase-adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation. Evaluating agentic competence requires benchmarks that test the joint deployment of multiple capabilities in multi-agent environments with conflicting incentives, uncertainty, and economic dynamics.
Abstract（参考訳）: 本稿では,大言語モデル (LLM) を評価するマルチエージェントベンチマークである \textsc{Cattle Trade を紹介する。このベンチマークは、競売、隠しオフ・トレード・チャレンジ(TC)、バーゲティング、ブラッフィング、対戦相手モデリング、そして50-60ターン続く1つのロングホライゾンゲーム内のリソース割り当てを組み合わせたものである。これらの能力を個別にテストする以前のエージェントベンチマークとは異なり、 \textsc{Cattle Trade} はエージェントが競争力のあるマルチエージェント経済ゲームと矛盾するインセンティブでそれらを統合するかどうかを評価する。ベンチマークは、TCが提供するすべての入札、カウンターオフ、カードの選択を記録し、最終的なスコアや勝利率を超えた行動分析を可能にする。我々は,242ゲームに対して,コスト効率のよい7つの言語モデルと3つの決定論的コードエージェントを評価した。戦略的コヒーレンス、特に消費効率、資源の規律、フェーズ適応入札は、消費量やどのサブスキルよりも強いランクと結びついている。 2つのヒューリスティックなコードエージェントは、最もテストされたLLMよりも優れており、動作トレースは、過剰行為、自己複製、倒産したTC開始、弱い反対状態適応を含むLCMの障害モードを繰り返す。エージェント能力の評価には、矛盾するインセンティブ、不確実性、経済力学を伴うマルチエージェント環境における複数の機能の共同展開をテストするベンチマークが必要である。

論文の概要: Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

関連論文リスト