Fugu-MT 論文翻訳(概要): Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

論文の概要: Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

arxiv url: http://arxiv.org/abs/2606.08081v1
Date: Sat, 06 Jun 2026 10:05:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:05.765768
Title: Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions
Title（参考訳）: LLMのマルチモーダルなエージェントが、人間のような慣習を使わずに参照ゲームでいかに学んだか
Authors: Po-Ya Angela Wang, Chinmaya Mishra, Aslı Özyürek, Paula Rubio-Fernández, Esam Ghaleb,
Abstract要約: 我々はKTH Tangrams corpusのヒトダイアドと有能なマルチモーダルエージェントダイアドを比較した。人間は、トレーニング、説明の圧縮、パートナーとのラベルアライメントの増大による労力を削減します。エージェントは固定された作業レベルを維持し、ラウンド1から冗長な記述を生成する。
参考スコア（独自算出の注目度）: 2.3410384770553154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become more efficient across rounds, although they align on the labels they use. How can we determine whether this alignment reflects partner-specific grounding rather than a shared task vocabulary? We address this question by comparing capable multimodal agent dyads with human dyads from the KTH Tangrams corpus. Our novel methodological contribution is a constrained pseudo-dyad baseline that matches the original referential task structure, but breaks partner history. This baseline enables us to test whether the observed label alignment depends on interaction with a specific partner. Across three analytic layers (task competence, description strategy, alignment dynamics), we find clear differences. Humans reduce effort through entrainment, compressing descriptions and increasing label alignment with partners. Agents instead maintain fixed effort levels, producing verbose descriptions from round one, with near-ceiling label overlap that is statistically indistinguishable between real and pseudo dyads. MLLMs thus achieve coordination without convention, succeeding by verbose description rather than by forming the compact, history-dependent referring expressions characteristic of human dialogue.
Abstract（参考訳）: 繰り返し参照ゲームは、インターロケータが初期の長い記述を、共有インタラクション履歴に基づくより短いパートナー固有の慣習に置き換えるかどうかをテストする。以前の研究によると、マルチモーダル LLM は、使用するラベルに適合するが、ラウンド毎に効率が良くならない。このアライメントが共有タスク語彙よりも,パートナ固有の接地を反映するかどうかをどうやって判断すればよいのか? KTH Tangrams corpus の人間ダイアドと有能なマルチモーダルエージェントダイアドを比較し,この問題に対処する。提案手法は,従来の参照タスク構造と一致するが,パートナー履歴を破る制約付き擬似ダイアドベースラインである。このベースラインは、観測されたラベルアライメントが特定のパートナーとの相互作用に依存するかどうかをテストすることができる。 3つの分析層(タスク能力、説明戦略、アライメントダイナミクス)にまたがって、明らかな相違点を見出す。人間は、トレーニング、説明の圧縮、パートナーとのラベルアライメントの増大による労力を削減します。エージェントは固定された作業レベルを維持し、ラウンド1から冗長な記述を生成する。 MLLMは、人間の対話の特徴を持つコンパクトで歴史に依存した参照表現を形成するのではなく、冗長な記述によって、慣例なく協調する。

論文の概要: Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

関連論文リスト