Fugu-MT 論文翻訳(概要): DiLA: Disentangled Latent Action World Models

論文の概要: DiLA: Disentangled Latent Action World Models

arxiv url: http://arxiv.org/abs/2605.15725v1
Date: Fri, 15 May 2026 08:22:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 17:44:16.311511
Title: DiLA: Disentangled Latent Action World Models
Title（参考訳）: DiLA: 潜入型アクションワールドモデル
Authors: Tianqiu Zhang, Muyang Lyu, Yufan Zhang, Fang Fang, Si Wu,
Abstract要約: ラテントアクションモデル(LAM)は、ラベルのないビデオから世界モデルの学習を可能にする。 LAMは、アクション抽象化とジェネレーションフィリティの根本的なトレードオフに直面します。コンテンツ構造不整合(contentanglement)を通じてこのトレードオフを解決することを目的とした,新しいディスタングル・ラテント・アクション・ワールド・モデルであるDiLAを紹介する。
参考スコア（独自算出の注目度）: 11.259992289079534
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade-off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage training with pre-trained world models or by limiting predictions to optical flow. In this paper, we introduce DiLA, a novel Disentangled Latent Action world model that aims to resolve this trade-off via content-structure disentanglement. Our key insight is that disentanglement and latent action learning are co-evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This synergy yields a continuous, semantically structured latent action space without compromising generative quality. DiLA achieves superior results in video generation quality, action transfer, visual planning, and manifold interpretability. These findings establish DiLA as a unified framework that simultaneously achieves high-level action abstraction and high-fidelity generation, advancing the frontier of self-supervised world model learning.
Abstract（参考訳）: ラテントアクションモデル(LAM)は、連続するフレーム間で抽象的なアクションを推論することによって、ラベルのないビデオから世界モデルの学習を可能にする。しかし、LAMはアクションの抽象化とジェネレーションの忠実さの間に根本的なトレードオフに直面している。既存の方法は、訓練済みの世界モデルを用いた2段階のトレーニングや、光の流れの予測に制限を加えることで、この問題を回避するのが一般的である。本稿では,コンテンツ構造不整合(contentanglement)を介し,このトレードオフを解決することを目的とした,新しいディスタングル・ラテント・アクション・ワールド・モデルであるDiLAを紹介する。潜在行動学習に固有の予測的ボトルネックは、乱れの駆動力として機能し、視覚的詳細を生成のために別個のコンテンツ経路にオフロードしながら、空間的レイアウトを構造経路に挿入するようにモデルを説得する。このシナジーは、生成的品質を損なうことなく、連続的に、意味的に構造化された潜在作用空間をもたらす。 DiLAは、ビデオ生成の品質、アクション転送、視覚計画、および多様体解釈性において優れた結果を得る。これらの知見は、高レベルのアクション抽象化と高忠実度生成を同時に達成し、自己教師付き世界モデル学習のフロンティアを推し進める統一的なフレームワークとして、DiLAを確立した。

論文の概要: DiLA: Disentangled Latent Action World Models

関連論文リスト