Fugu-MT 論文翻訳(概要): InstructX: Towards Unified Visual Editing with MLLM Guidance

論文の概要: InstructX: Towards Unified Visual Editing with MLLM Guidance

arxiv url: http://arxiv.org/abs/2510.08485v1
Date: Thu, 09 Oct 2025 17:26:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.245076
Title: InstructX: Towards Unified Visual Editing with MLLM Guidance
Title（参考訳）: InstructX: MLLMガイダンスによる統一ビジュアル編集を目指して
Authors: Chong Mou, Qichao Sun, Yanze Wu, Pengze Zhang, Xinghui Li, Fulong Ye, Songtao Zhao, Qian He,
Abstract要約: InstructXは画像とビデオの編集を統一したフレームワークである。画像データのトレーニングは、明示的な監督なしに、創発的な映像編集能力をもたらす可能性があることを示す。本手法は,モダリティ固有のMLLM機能を組み込むことで,画像編集タスクと映像編集タスクを1つのモデルに効果的に統合する。
参考スコア（独自算出の注目度）: 29.397808703869075
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.
Abstract（参考訳）: 近年のMLLM(Multimodal Large Language Models)の進歩により,拡散モデルの編集性能向上への関心が高まっている。急速な進歩にもかかわらず、ほとんどの研究はMLLMの設計選択に関する詳細な分析を欠いている。さらに、MLLMと拡散モデルの統合は、ビデオ編集などの難しいタスクにおいて、依然としてオープンな課題である。本稿では,画像編集と映像編集のための統合フレームワークであるInstructXを提案する。具体的には,MLLMと拡散モデルの統合に関する総合的研究を行い,多種多様なタスクを対象とした命令駆動編集を行う。本研究は,統合モデリングにおける画像と映像の協調と区別について分析する。 1) 画像データのトレーニングは,映像編集能力の向上につながる可能性を示し,映像学習データ不足による制約を緩和する。 2)モダリティ固有のMLLM特徴を取り入れることで,単一のモデルに画像編集タスクと映像編集タスクを効果的に統合する。大規模な実験により,本手法は広い範囲の映像・映像編集タスクを処理し,最先端の性能を実現することができることが示された。

論文の概要: InstructX: Towards Unified Visual Editing with MLLM Guidance

関連論文リスト