Fugu-MT 論文翻訳(概要): Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

論文の概要: Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

arxiv url: http://arxiv.org/abs/2510.08470v1
Date: Thu, 09 Oct 2025 17:10:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.23263
Title: Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling
Title（参考訳）: 低リソースビジョンランゲージモデリングのためのToken-wise Dynamic Gating
Authors: Bianca-Mihaela Ganescu, Suchir Salhan, Andrew Caines, Paula Buttery,
Abstract要約: 認知的に証明可能な量のデータに基づいて視覚言語モデルをトレーニングするには、モデルがマルチモーダル情報を統合する方法を再考する必要がある。本稿では,トークンワイド動的ゲーティングを用いた軽量デコーダアーキテクチャを提案する。
参考スコア（独自算出の注目度）: 3.5408685781175016
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information. Within the constraints of the Vision track for the BabyLM Challenge 2025, we propose a lightweight decoder-based architecture with (1) token-wise dynamic gating for adaptive fusion of linguistic and visual cues, (2) feature modulation and channel attention to maximise the utility of limited visual information and (3) auxiliary contrastive objectives for visual grounding. Evaluation on five benchmarks (BLiMP, BLiMP Supplement, EWoK, Winoground and VQA) shows competitive or superior performance to multimodal baselines. More notably, our dynamic gate discovers interpretable patterns without explicit supervision, favouring visual cues for content words and linguistic cues for function words. While we identify limitations in the Challenge constraints, such as the information bottleneck created by global image embeddings and training instability from the dataset split, our findings establish dynamic gating as a powerful tool for efficient multimodal learning, offering both interpretability and performance even under severe constraints.
Abstract（参考訳）: 認知的に証明可能な量のデータに基づいて視覚言語モデルをトレーニングするには、モデルがマルチモーダル情報を統合する方法を再考する必要がある。 BabyLM Challenge 2025のビジョントラックの制約の中で,(1)言語的および視覚的手がかりの適応的融合のためのトークンワイド動的ゲーティング,(2)限られた視覚情報の有用性を最大化するための特徴変調とチャネルアテンション,(3)視覚的接地のための補助的コントラスト目的を含む軽量デコーダベースのアーキテクチャを提案する。 BLiMP, BLiMP Supplement, EWoK, Winoground, VQAの5つのベンチマークによる評価は、マルチモーダルベースラインに対する競合性または優れた性能を示している。より顕著なことは、我々の動的ゲートは、明示的な監督なしに解釈可能なパターンを発見し、内容語には視覚的手がかり、機能語には言語的手がかりを好むことである。我々は,グローバルな画像埋め込みによる情報ボトルネックやデータセットからのトレーニング不安定性などの制約を識別する一方で,動的ゲーティングを効率的なマルチモーダル学習のための強力なツールとして確立し,解釈可能性と性能を厳しい制約の下でも提供する。

論文の概要: Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

関連論文リスト