Fugu-MT 論文翻訳(概要): Language-Guided Abstraction for Visual Reasoning

論文の概要: Language-Guided Abstraction for Visual Reasoning

arxiv url: http://arxiv.org/abs/2606.12847v1
Date: Thu, 11 Jun 2026 03:22:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.559389
Title: Language-Guided Abstraction for Visual Reasoning
Title（参考訳）: ビジュアル推論のための言語ガイドによる抽象化
Authors: Xu-Jing Ye, Yuan-Gen Wang, Ruping Wang,
Abstract要約: 本稿では,プライビレグド情報ブランチを用いた言語指導学習を通じて視覚的推論を強化する新しいフレームワークを提案する。具体的には、DeepSeek-V3にタスクに依存しない統一的なプロンプトを供給することで、セマンティック圧縮モジュールを設計する。また,視覚的特徴を意味的埋め込みと整合させるクロスアテンションプロジェクタを設計し,ARCモデルのトレーニングを指導することを目的とした。
参考スコア（独自算出の注目度）: 8.097020439992205
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i.e., VARC). The former depends heavily on LLMs, consuming billions of parameters. The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3. In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e.g., CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters. Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at https://github.com/GZHU-DVL/L-VARC.
Abstract（参考訳）: 抽象推論コーパス(ARC)は、モデルがいくつかの例から抽象変換ルールを学習し、新しいタスクに一般化できるようにするため、人工知能(AGI)にとって重要な道であると見なされている。しかし、一般的なARCの方法論は純粋言語か視覚のみ(すなわちVARC)である。前者はLLMに大きく依存し、数十億のパラメータを消費する。後者は高レベルのセマンティクスをキャプチャするのにしばしば苦労し、ピクセルレベルのパターンに過度に適合する。このギャップを埋めるために,L-VARCを提案する。L-VARCは言語誘導学習(LUPI)ブランチを介して視覚的推論を強化する新しいフレームワークである。具体的には、DeepSeek-V3にタスクに依存しない統一的なプロンプトを供給することで、セマンティック圧縮モジュールを設計する。このようにして、生のLARC(クラウドソース言語記述データセット)は、標準テキストエンコーダ(例えば、CLIP)のコンテキスト長制約に適合して、大幅に洗練され、構造化される。さらに,視覚的特徴を意味的埋め込みと整合させるクロスアテンションプロジェクタを設計し,ARCモデルのトレーニングを指導することを目的とした。特に、LUPIブランチはトレーニングプロセスに取り入れられ、推論中に破棄されるため、わずか1800万のパラメータを持つ軽量モデルが生成される。我々のL-VARCは、視覚的推論を向上し、最先端の成果を上げるために、言語的先行を効果的に活用することを示した。アブレーション研究は、L-VARCフレームワークに対する2つの新しい設計の貢献をさらに確認した。コードはhttps://github.com/GZHU-DVL/L-VARCで公開されている。

論文の概要: Language-Guided Abstraction for Visual Reasoning

関連論文リスト