Fugu-MT 論文翻訳(概要): Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

論文の概要: Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

arxiv url: http://arxiv.org/abs/2604.21144v1
Date: Wed, 22 Apr 2026 23:15:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.210083
Title: Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue
Title（参考訳）: 音声対話における共通場表現のためのマシン・メンタル・イメージの利用
Authors: Biswesh Mohapatra, Giovanni Duca, Laurent Romary, Justine Cassell,
Abstract要約: 話し手は、共有コンテキストの信頼性のある表現を維持する必要がある。現在の会話エージェントは、しばしばこの要件に苦しむ。本稿では,対話状態を永続的な視覚履歴に変換するアクティブな視覚足場フレームワークを提案する。
参考スコア（独自算出の注目度）: 3.1039961644960186
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Situated dialogue requires speakers to maintain a reliable representation of shared context rather than reasoning only over isolated utterances. Current conversational agents often struggle with this requirement, especially when the common ground must be preserved beyond the immediate context window. In such settings, fine-grained distinctions are frequently compressed into purely textual representations, leading to a critical failure mode we call \emph{representational blur}, in which similar but distinct entities collapse into interchangeable descriptions. This semantic flattening creates an illusion of grounding, where agents appear locally coherent but fail to track shared context persistently over time. Inspired by the role of mental imagery in human reasoning, and based on the increased availability of multimodal models, we explore whether conversational agents can be given an analogous ability to construct some depictive intermediate representations during dialogue to address these limitations. Thus, we introduce an active visual scaffolding framework that incrementally converts dialogue state into a persistent visual history that can later be retrieved for grounded response generation. Evaluation on the IndiRef benchmark shows that incremental externalization itself improves over full-dialog reasoning, while visual scaffolding provides additional gains by reducing representational blur and enforcing concrete scene commitments. At the same time, textual representations remain advantageous for non-depictable information, and a hybrid multimodal setting yields the best overall performance. Together, these findings suggest that conversational agents benefit from an explicitly multimodal representation of common ground that integrates depictive and propositional information.
Abstract（参考訳）: 定位対話では、話者は孤立した発話のみを推論するのではなく、共有コンテキストの信頼性のある表現を維持する必要がある。現在の会話エージェントはこの要件に苦しむことが多く、特に共通基盤が即時コンテキストウィンドウを超えて保存されなければならない場合である。このような設定では、きめ細かい区別はしばしば純粋にテキスト表現に圧縮されるため、我々が「emph{representational blur}」と呼ぶ臨界失敗モードとなり、類似しているが別個の実体が交換可能な記述に崩壊する。このセマンティックフラット化は、エージェントが局所的に一貫性を持つように見えるが、時間とともに共有コンテキストの追跡に失敗する、グラウンド化の錯覚を生み出す。人間の推論における心的イメージの役割に触発され、多モーダルモデルが利用可能になったことを踏まえ、会話エージェントは、これらの制限に対処するために対話中にいくつかの描写的な中間表現を構築することができるのかを考察する。そこで本稿では,対話状態を永続的な視覚履歴に段階的に変換する,アクティブな視覚足場構築フレームワークを提案する。 IndiRefベンチマークの評価では、インクリメンタルな外部化自体が完全なダイアログ推論よりも改善されている一方で、視覚的なスキャフォールディングは、表現のぼやけを減らし、具体的なシーンのコミットメントを強制することによって、さらなる利益をもたらす。同時に、テキスト表現は非決定的な情報に対して有利なままであり、ハイブリッドマルチモーダル設定は、全体的なパフォーマンスを最高のものにします。これらの結果から,会話エージェントは,具体的情報と命題情報を統合した共通基盤のマルチモーダル表現の恩恵を受けることが示唆された。

論文の概要: Using Machine Mental Imagery for Representing Common Ground in Situated Dialogue

関連論文リスト