Fugu-MT 論文翻訳(概要): LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models

論文の概要: LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models

arxiv url: http://arxiv.org/abs/2512.13290v1
Date: Mon, 15 Dec 2025 12:59:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-16 17:54:56.667794
Title: LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models
Title（参考訳）: LINA:拡散モデルにおける物理的アライメントと一般化を適応的に学習する
Authors: Shu Yu, Chaochao Lu,
Abstract要約: 拡散モデル(DM)は画像およびビデオ生成において顕著な成功を収めた。しかし、(1)物理的アライメントと(2)アウト・オブ・ディストリビューション(OOD)命令に苦戦している。これらの問題は、モデルが因果方向を学習し、新しい組み換えのための因果的要因を解き放つことに起因している、と我々は主張する。本稿では,迅速な介入を予測する新しいフレームワークLINAを紹介する。
参考スコア（独自算出の注目度）: 19.37375277387649
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models' failure to learn causal directions and to disentangle causal factors for novel recombination. We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps. Based on these findings, we introduce LINA (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset. Our project page is at https://opencausalab.github.io/LINA.
Abstract（参考訳）: 拡散モデル(DM)は画像およびビデオ生成において顕著な成功を収めた。しかし、(1)物理的アライメントと(2)アウト・オブ・ディストリビューション(OOD)命令に苦戦している。これらの問題は、モデルが因果方向を学習し、新しい組み換えのための因果的要因を解き放つことに起因している、と我々は主張する。本稿では、診断介入を可能にするために、Causal Scene Graph(CSG)とPhysical Alignment Probe(PAP)データセットを紹介する。この分析は3つの重要な洞察をもたらす。第一に、DMはプロンプトで明示的に決定されていない要素のマルチホップ推論に苦労する。第二に、プロンプト埋め込みはテクスチャと物理のための不整合表現を含む。第3に、視覚因果構造は、計算的に制限された初期段階において不均等に確立される。これらの知見に基づき, LINA (Learning Interventions Adaptively) を導入し, 1) 即発的および視覚的潜伏空間における目標誘導と(2) 再配置された因果認識型認知スケジュールを用いた, 即発的介入の予測を学習する新しいフレームワークを提案する。提案手法は,画像およびビデオDMにおける物理アライメントとOODインストラクションの両方を適用し,因果生成課題とWinogroundデータセットに対する最先端のパフォーマンスを実現する。プロジェクトページはhttps://opencausalab.github.io/LINA.orgにある。

論文の概要: LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models

関連論文リスト