Fugu-MT 論文翻訳(概要): Brain alignment of reasoning and action representations from vision-language and action models during naturalistic gameplay

論文の概要: Brain alignment of reasoning and action representations from vision-language and action models during naturalistic gameplay

arxiv url: http://arxiv.org/abs/2605.19352v1
Date: Tue, 19 May 2026 04:40:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.12458
Title: Brain alignment of reasoning and action representations from vision-language and action models during naturalistic gameplay
Title（参考訳）: 自然主義ゲームプレイにおける視覚言語と行動モデルからの推論と行動表現の脳内アライメント
Authors: Subba Reddy Oota, Anant Khandelwal, Khushbu Pahwa, Satya Sai Srinath Namburi, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta,
Abstract要約: fMRI記録を用いた2つの基礎モデル群からの代表モデルの脳内アライメントについて検討した。視覚言語モデル (VLM) と大動作モデル (LAM) の両方で, ボクセルエンコーディング性能が著しく向上していることが判明した。
参考スコア（独自算出の注目度）: 40.44107959566121
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding how humans and artificial intelligence systems predict and plan by interacting with their environment is a fundamental challenge at the intersection of neuroscience and machine learning. Most brain-encoding studies focus on aligning artificial models with brain activity during language comprehension or passive visual processing, while interactive brain-alignment studies have to date been largely limited to reinforcement-learning (RL) agents and theory-based models. To address this gap, we study brain alignment of representative models from two foundation-model families, namely vision-language models (VLMs) and large-action models (LAMs), using fMRI recordings from participants playing naturalistic Atari-style video games. Specifically, we examine how action-focused and reasoning-focused prompts shape model's internal representations and align with fMRI brain activity. First, we find that both VLMs and LAMs exhibit significantly exhibit voxel-wise encoding performance than RL baselines, with the advantage holding even under matched feature dimensionality. Second, prompt-driven gains scale with the cortical processing hierarchy: the largest improvements appear in frontal-parietal and motor-planning regions, while early visual cortex gains roughly half as much. Third, variance partitioning reveals a qualitatively different representational organization: VLM is prompt-symmetric (12.5% unique action vs. 13.6% unique reasoning), whereas LAM is prompt-asymmetric (27% unique action vs. -5% unique reasoning), with the asymmetry strongest in frontal-motor cortex. Together, these results demonstrate that action-specialized fine-tuning reorganizes multimodal representations toward action-relevant neural computations even when whole-brain prediction accuracy is statistically equivalent between VLM and LAM.
Abstract（参考訳）: 人間と人工知能システムが環境と相互作用することでどのように予測し、計画するかを理解することは、神経科学と機械学習の交差において根本的な課題である。ほとんどの脳エンコーディング研究は、言語理解や受動的視覚処理の間、人工モデルと脳活動の整合性に重点を置いているが、インタラクティブな脳適応の研究は、これまでは強化学習(RL)エージェントと理論に基づくモデルに限られてきた。本研究では,視覚言語モデル (VLM) と大アクションモデル (LAM) の2つのファウンデーションモデルによる代表モデルの脳内アライメントを,自然主義的アタリ型ビデオゲームの参加者のfMRI記録を用いて検討した。具体的には, 行動に焦点をあて, 推論に焦点をあてた形状モデルの内部表現の促進とfMRI脳活動の整合性について検討する。まず, VLM と LAM の両者は, RL ベースラインよりもボクセル級の符号化性能を示し, その優位性は, 特徴次元の整合下においても有意であることがわかった。第2に、プロンプト駆動のゲインは皮質の処理階層とともにスケールし、前頭頂葉と運動計画領域で最大の改善が見られ、一方初期視覚野はおよそ半分に向上する。 VLMはプロンプト対称(12.5%のユニークな作用と13.6%のユニークな推論)であり、一方LAMはプロンプト非対称(27%のユニークな作用と-5%のユニークな推論)であり、前頭運動野では非対称性が最強である。これらの結果は, VLM と LAM が統計的に等価である場合においても, 動作特異的な微調整により, 動作関連ニューラルネットワークに対するマルチモーダル表現が成立することを示した。

関連論文リスト

Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
我々は,MLLMに人間のような視覚的特徴を付与する訓練パラダイムであるCognitive Supersensingを紹介する。実験では、CogSense-BenchでCognitive Supersensingを訓練したMLLMが、最先端のベースラインを大きく上回った。私たちはCogSense-Benchとモデルウェイトをオープンソースにします。
論文参考訳（メタデータ） (2026-02-02T02:19:50Z)
Probing the Representational Geometry of Color Qualia: Dissociating Pure Perception from Task Demands in Brains and AI Models [6.165387850279033]
我々は、最先端のAIモデルと人間の脳とのカラークエーリアの表現幾何学の厳密な比較を行う。我々の研究は、Brain-Score互換フォーマットでパッケージされた、フィールドに色準位を付与するための新しいベンチマークタスクに貢献する。
論文参考訳（メタデータ） (2025-10-26T19:13:16Z)
Uncovering Brain-Like Hierarchical Patterns in Vision-Language Models through fMRI-Based Neural Encoding [34.313883741642066]
人工知能ニューラルネットワーク(ANN)と人間の脳処理の並列性の現在の理解は依然として限られている。視覚言語モデル(VLM)のマルチモーダル情報処理機構を人間の脳活動のレンズを通して解析する新しいニューロンレベル解析フレームワークを提案する。
論文参考訳（メタデータ） (2025-10-19T15:11:03Z)
Disentangling the Factors of Convergence between Brains and Computer Vision Models [11.560007214914465]
我々は、人間中心の画像で訓練された最大のDINOv3モデルが、最も高い脳相似性に達することを示した。これらの発見は、人工ニューラルネットワークが世界をどのように人間として見るかという、アーキテクチャと経験の間の相互作用を歪めている。
論文参考訳（メタデータ） (2025-08-25T17:23:27Z)
Do Large Language Models Think Like the Brain? Sentence-Level Evidence from fMRI and Hierarchical Embeddings [28.210559128941593]
本研究では,大規模言語モデルにおける階層的表現が,人文理解時の動的神経応答とどのように一致しているかを検討する。その結果、モデル性能の改善は、表現アーキテクチャを脳に似た階層へと進化させることを示した。
論文参考訳（メタデータ） (2025-05-28T16:40:06Z)
Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [58.86928947970342]
Embodied-Rは、知覚のための大規模視覚言語モデルと推論のための小規模言語モデルを組み合わせたフレームワークである。わずか5kのエボダイドビデオサンプルのトレーニングの後、Embodied-Rと3B LMは最先端のマルチモーダル推論モデルと一致した。 Embodied-Rは、体系的分析や文脈統合のような創発的な思考パターンも示している。
論文参考訳（メタデータ） (2025-04-17T06:16:11Z)
Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactorは、よく確立された認知心理学評価から20の視覚中心のサブテストをデジタル化するベンチマークである。 GPT、Gemini、Claude、LLaMA、Qwen、SEEDファミリーから20のフロンティアマルチモーダル言語モデル(MLLM)を評価する。最高のパフォーマンスモデルは100点中25.19点のスコアしか得られず、精神的な回転、空間的関係推論、図形の識別といったタスクに一貫して失敗する。
論文参考訳（メタデータ） (2025-02-23T04:21:32Z)
Brain-like Functional Organization within Large Language Models [58.93629121400745]
人間の脳は長い間人工知能(AI)の追求にインスピレーションを与えてきた最近のニューロイメージング研究は、人工ニューラルネットワーク(ANN)の計算的表現と、人間の脳の刺激に対する神経反応との整合性の説得力のある証拠を提供する。本研究では、人工ニューロンのサブグループと機能的脳ネットワーク(FBN)を直接結合することで、このギャップを埋める。このフレームワークはANサブグループをFBNにリンクし、大きな言語モデル(LLM)内で脳に似た機能的組織を記述できる。
論文参考訳（メタデータ） (2024-10-25T13:15:17Z)
Modelling Multimodal Integration in Human Concept Processing with Vision-Language Models [7.511284868070148]
視覚言語情報の統合が、人間の脳活動とより整合した表現に繋がるかどうかを考察する。ヒト脳活性化予測におけるマルチモーダルモデルの有用性が示唆された。
論文参考訳（メタデータ） (2024-07-25T10:08:37Z)
Multimodal foundation models are better simulators of the human brain [65.10501322822881]
1500万の画像テキストペアを事前訓練した,新たに設計されたマルチモーダル基礎モデルを提案する。視覚的エンコーダも言語的エンコーダもマルチモーダルで訓練され,脳に近いことが判明した。
論文参考訳（メタデータ） (2022-08-17T12:36:26Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。