Fugu-MT 論文翻訳(概要): Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models

論文の概要: Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models

arxiv url: http://arxiv.org/abs/2601.16378v1
Date: Fri, 23 Jan 2026 00:21:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-26 14:27:27.494644
Title: Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models
Title（参考訳）: 認知にインスパイアされたトークンはマルチモーダルモデルにおいてエゴセントリックバイアスを克服する
Authors: Bridget Leonard, Scott O. Murray,
Abstract要約: マルチモーダル言語モデル(MLM)は、他のエージェントの視覚的視点を採用する必要がある空間的推論において失敗する。人間の空間認識にインスパイアされた視点トークンは,(1)具体的身体キーポイント・キュー,あるいは(2)心的回転を支える抽象的表現によって,向きを符号化する。総合的および自然主義的なベンチマーク全体において、パースペクティブトークンは精度を向上し、ローテーションベースのトークンは非人間参照エージェントに一般化される。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal language models (MLMs) perform well on semantic vision-language tasks but fail at spatial reasoning that requires adopting another agent's visual perspective. These errors reflect a persistent egocentric bias and raise questions about whether current models support allocentric reasoning. Inspired by human spatial cognition, we introduce perspective tokens, specialized embeddings that encode orientation through either (1) embodied body-keypoint cues or (2) abstract representations supporting mental rotation. Integrating these tokens into LLaVA-1.5-13B yields performance on level-2 visual perspective-taking tasks. Across synthetic and naturalistic benchmarks (Isle Bricks V2, COCO, 3DSRBench), perspective tokens improve accuracy, with rotation-based tokens generalizing to non-human reference agents. Representational analyses reveal that fine-tuning enhances latent orientation sensitivity already present in the base model, suggesting that MLMs contain precursors of allocentric reasoning but lack appropriate internal structure. Overall, embedding cognitively grounded spatial structure directly into token space provides a lightweight, model-agnostic mechanism for perspective-taking and more human-like spatial reasoning.
Abstract（参考訳）: マルチモーダル言語モデル(MLM)は、意味的視覚言語タスクではうまく機能するが、他のエージェントの視覚的視点を採用する必要がある空間的推論では失敗する。これらの誤りは、永続的な自己中心的バイアスを反映し、現在のモデルが同心的推論をサポートするかどうかに関する疑問を提起する。人間の空間認知に触発されて,(1)具体化された身体キーポイント・キュー,あるいは(2)心的回転を支える抽象的な表現を通じて向きを符号化する,視点トークン,特殊な埋め込みを導入する。これらのトークンをLLaVA-1.5-13Bに統合すると、レベル2の視覚的視点取得タスクのパフォーマンスが得られる。総合的および自然主義的なベンチマーク(Isle Bricks V2, COCO, 3DSRBench)を通じて、パースペクティブトークンは精度を向上し、回転ベースのトークンは人間以外の参照エージェントに一般化する。表現的分析により、微調整はベースモデルにすでに存在する潜在配向感度を高めることが示され、MLMはアロセントリックな推論の前駆体を含むが、適切な内部構造を持たないことが示唆された。全体として、認知的に接地された空間構造を直接トークン空間に埋め込むことは、視点を取るための軽量でモデルに依存しないメカニズムと、より人間らしい空間的推論を提供する。

論文の概要: Cognitively-Inspired Tokens Overcome Egocentric Bias in Multimodal Models

関連論文リスト