Fugu-MT 論文翻訳(概要): Rethinking Visual Intelligence: Insights from Video Pretraining

論文の概要: Rethinking Visual Intelligence: Insights from Video Pretraining

arxiv url: http://arxiv.org/abs/2510.24448v1
Date: Tue, 28 Oct 2025 14:12:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 15:35:37.222884
Title: Rethinking Visual Intelligence: Insights from Video Pretraining
Title（参考訳）: ビジュアルインテリジェンスを再考する:ビデオプレトレーニングからの洞察
Authors: Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, Paolo Favaro,
Abstract要約: 大規模言語モデル(LLM)は、大規模事前学習によってシステムが新しい問題に迅速に適応できることを実証している。本稿では,映像拡散モデル(VDM)をギャップを埋めるための有望な方向として検討する。
参考スコア（独自算出の注目度）: 75.32388528274224
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.
Abstract（参考訳）: 大規模言語モデル(LLM)は、大規模事前学習により、言語領域のほとんど監督することなく、システムが新しい問題に迅速に適応できることを実証している。しかし、この成功は、LLMを含むモデルが構成的理解、サンプル効率、汎用的な問題解決に苦戦し続けている視覚領域に効果的に翻訳されていない。本稿では,ビデオ拡散モデル(VDM)を,このギャップを埋めるための有望な方向として検討する。時空間データの事前学習は、これらのモデルに構造と力学の強い帰納バイアスを与える。そこで本研究では,LLM と VDM の両方に軽量なアダプタを装備し,自然モードでタスクを提示する制御評価を設計する。 ARC-AGI、ConceptARC、ビジュアルゲーム、ルート計画、セルラーオートマトンなどを含むベンチマークでは、VDMは言語よりも高いデータ効率を示している。この結果から,映像事前学習は視覚基礎モデルへの進歩を支援する帰納的バイアスを与えることが示された。

論文の概要: Rethinking Visual Intelligence: Insights from Video Pretraining

関連論文リスト