Fugu-MT 論文翻訳(概要): From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models

論文の概要: From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models

arxiv url: http://arxiv.org/abs/2506.07280v2
Date: Tue, 10 Jun 2025 11:37:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-11 12:52:34.297161
Title: From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models
Title（参考訳）: 生成から一般化へ:ビデオ拡散モデルにおける創発的なFew-Shot学習
Authors: Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, Paolo Favaro,
Abstract要約: ビデオ拡散モデル(VDM)は高品質なコンテンツを合成できる強力な生成ツールとして登場した。我々は、VDMが自然に構造化された表現を探索し、視覚世界を暗黙的に理解することを主張する。提案手法は,各タスクを視覚遷移に変換し,短い入力シーケンス上でLoRA重みのトレーニングを可能にする。
参考スコア（独自算出の注目度）: 65.0487600936788
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Diffusion Models (VDMs) have emerged as powerful generative tools, capable of synthesizing high-quality spatiotemporal content. Yet, their potential goes far beyond mere video generation. We argue that the training dynamics of VDMs, driven by the need to model coherent sequences, naturally pushes them to internalize structured representations and an implicit understanding of the visual world. To probe the extent of this internal knowledge, we introduce a few-shot fine-tuning framework that repurposes VDMs for new tasks using only a handful of examples. Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input-output sequences without altering the generative interface of a frozen VDM. Despite minimal supervision, the model exhibits strong generalization across diverse tasks, from low-level vision (for example, segmentation and pose estimation) to high-level reasoning (for example, on ARC-AGI). These results reframe VDMs as more than generative engines. They are adaptable visual learners with the potential to serve as the backbone for future foundation models in vision.
Abstract（参考訳）: ビデオ拡散モデル(VDM)は、高品質な時空間コンテンツを合成できる強力な生成ツールとして登場した。しかし、その潜在能力は単なるビデオ生成に留まらない。我々は、コヒーレントなシーケンスをモデル化する必要により、VDMのトレーニングダイナミクスが自然に、構造化された表現の内部化と視覚世界に対する暗黙の理解を促進することを主張する。この内部知識の範囲を調査するために,ごく少数の例を用いて,VDMを新しいタスクに活用する,数発の微調整フレームワークを導入する。本手法は,各タスクを視覚遷移に変換し,凍結したVDMの生成インターフェースを変更することなく,短い入力出力シーケンス上でLoRA重みのトレーニングを可能にする。最小限の監督にもかかわらず、モデルは低レベルの視覚(例えば、セグメンテーションやポーズ推定)から高レベルの推論(例えば、ARC-AGI)まで、様々なタスクにまたがる強力な一般化を示す。これらの結果は、VDMを生成エンジン以上のものに再構成する。それらは適応可能な視覚学習者であり、将来の視覚基盤モデルのバックボーンとして機能する可能性がある。

論文の概要: From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models

関連論文リスト