Fugu-MT 論文翻訳(概要): VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

論文の概要: VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

arxiv url: http://arxiv.org/abs/2605.28818v1
Date: Wed, 27 May 2026 17:59:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:56.272058
Title: VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading
Title（参考訳）: VLMは自然読解中に人間のアライメントを世界規模で強化しないかもしれない
Authors: Jinzhou Wu, Zhengwu Ma, Jixing Li, Baoping Tang, Zitong Lu,
Abstract要約: 我々は,大言語モデル (LLM) と視覚言語モデル (VLM) のペアを,厳密なテキストのみの設定で比較する。マルチモーダル事前学習は,自然読解時のヒトのアライメントにおいて,一様でグローバルな優位性を与えるものではない。本研究は,多モーダル事前学習が,自然読解時の人間的な言語表現よりも,多モーダル事前学習が選択的に寄与することを示唆している。
参考スコア（独自算出の注目度）: 4.643551569750331
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have become increasingly useful computational models of human language processing, but it remains unclear whether vision-language learning makes text representations more human-like during natural reading. Here, we address this question by comparing tightly matched LLM and vision-language model (VLM) pairs under a strictly text-only setting, allowing us to isolate the effect of multimodal training history from online visual input or cross-modal fusion. We evaluate model alignment with a human natural-reading dataset that includes whole-cortex fMRI responses and synchronized eye-tracking saccades. Our findings demonstrate that multimodal pretraining may not confer a uniform, global advantage in human alignment during natural reading, indicating that language-internal representations remain the key factor for modeling human text processing. However, the VLM advantage could emerge more selectively when sentences contain stronger visual semantic content, with converging evidence from both fMRI and eye-movement alignments. Together, our findings provide a controlled in silico framework for testing how visual learning history shapes model-human alignment of language processing, suggesting that multimodal pretraining contributes selectively rather than globally to human-like language representations during natural reading.
Abstract（参考訳）: 大規模言語モデル(LLM)は、人間の言語処理においてますます有用な計算モデルになりつつあるが、視覚言語学習が自然読解中にテキスト表現をより人間らしくするかどうかは不明である。本稿では,厳密にマッチングされたLLMと視覚言語モデル(VLM)のペアを厳密なテキストのみの設定で比較することにより,オンライン視覚入力や相互融合からマルチモーダルトレーニング履歴の効果を分離する。我々は,全座標fMRI応答と同期眼球追跡サケードを含むヒトの自然読影データセットとのモデルアライメントを評価する。以上の結果から,多モーダル事前学習は自然読解時の人間のアライメントにおいて一様でグローバルな優位性を与えない可能性が示唆され,言語内部表現が人間のテキスト処理をモデル化する上で重要な要素であることが示唆された。しかし、VLMの優位性は、文がより強い視覚的意味的内容を含む場合により選択的に現れ、fMRIと眼球運動アライメントの両方からの証拠が集約される可能性がある。そこで本研究では,視覚学習履歴が言語処理のモデルと人間のアライメントをどのように形成するかを検証するためのシリコ・フレームワークについて検討した。

論文の概要: VLMs May Not Globally Enhance Human Alignment over LLMs During Natural Reading

関連論文リスト