Fugu-MT 論文翻訳(概要): Context Matters: Learning Global Semantics for Visual Reasoning and Comprehension

論文の概要: Context Matters: Learning Global Semantics for Visual Reasoning and Comprehension

arxiv url: http://arxiv.org/abs/2510.05674v1
Date: Tue, 07 Oct 2025 08:33:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:08.158303
Title: Context Matters: Learning Global Semantics for Visual Reasoning and Comprehension
Title（参考訳）: コンテキスト:ビジュアル推論と理解のためのグローバルセマンティクスの学習
Authors: Jike Zhong, Yuxiang Lai, Xiaofeng Yang, Konstantinos Psounis,
Abstract要約: ビジョンモデルは、コンテキスト内学習において同等の進歩を見せていない。このギャップは、現在の視覚変換器(ViT)トレーニングスキームにおける意味的および文脈的ガイダンスの欠如に起因する可能性がある、と我々は主張する。対象」を「単語」の視覚的等価性として直接モデル化し、そのモデルに視覚要素間のグローバルな文脈と意味を学習させることを提案する。
参考スコア（独自算出の注目度）: 8.195437248815802
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in language modeling have witnessed the rise of highly desirable emergent capabilities, such as reasoning and in-context learning. However, vision models have yet to exhibit comparable progress in these areas. In this paper, we argue that this gap could stem from the lack of semantic and contextual guidance in current vision transformer (ViT) training schemes, and such a gap can be narrowed through the design of a semantic-grounded objective. Specifically, we notice that individual words in natural language are inherently semantic, and modeling directly on word tokens naturally learns a realistic distribution. In contrast, ViTs rely on spatial patchification, which inevitably lacks semantic information. To bridge this gap, we propose to directly model "object" as the visual equivalence of "word," pushing the model to learn the global context and semantics among visual elements. We investigate our hypotheses via masked image modeling (MIM), a framework where our approach can be readily tested by applying masks to visual objects rather than random patches. Considerable evidence from qualitative and quantitative evaluations reveals a key finding: object-level representation alone helps to learn a real-world distribution, whereas pixel-averaging shortcuts are often learned without it. Moreover, further evaluations with multimodal LLMs (MLLM) on visual question answering (VQA, GQA, ScienceQA) tasks demonstrate the strong reasoning and contextual understanding gained with this simple objective. We hope our study highlights the effectiveness of object-level encoding and provides a plausible direction for developing stronger vision encoders and tokenizers. Code and model will be publicly released. Keywords: Semantic Visual Tokenizer, Vision Reasoning, In-context Learning, Multimodal Reasoning
Abstract（参考訳）: 言語モデリングの最近の進歩は、推論や文脈内学習など、非常に望ましい創発的能力の台頭を目撃している。しかしながら、視覚モデルはこれらの領域で同等の進歩を見せていない。本稿では、現在の視覚変換器(ViT)のトレーニング手法において、このギャップは意味的および文脈的ガイダンスの欠如に起因する可能性があり、そのギャップは意味的接地対象の設計によって狭めることができると論じる。具体的には、自然言語の個々の単語が本質的に意味的であることに気付き、単語トークンを直接モデリングすることで、現実的な分布を自然に学習する。対照的に、ViTは、必然的に意味情報を欠いている空間的整合性に依存している。このギャップを埋めるために、我々は「言葉」の視覚的等価性として「オブジェクト」を直接モデル化し、視覚要素間のグローバルな文脈や意味を学習するようモデルを押し付けることを提案する。マスクを用いた画像モデリング(MIM)による仮説を,ランダムなパッチではなく,視覚的オブジェクトにマスクを適用することで,我々のアプローチを容易に検証できるフレームワークとして検討する。質的、定量的な評価から考慮すべき証拠は、オブジェクトレベルの表現だけで現実世界の分布を学ぶのに役立ちます。さらに、視覚的質問応答(VQA、GQA、ScienceQA)タスクにおけるマルチモーダルLLM(MLLM)によるさらなる評価は、この単純な目的によって得られる強い推論と文脈的理解を示す。本研究は、オブジェクトレベルのエンコーディングの有効性を強調し、より強力な視覚エンコーダとトークン化器を開発するための有効な方向性を提供することを願っている。コードとモデルは公開されます。キーワード:セマンティックビジュアルトケナイザ、ビジョン推論、インコンテキスト学習、マルチモーダル推論

論文の概要: Context Matters: Learning Global Semantics for Visual Reasoning and Comprehension

関連論文リスト