Fugu-MT 論文翻訳(概要): InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation

論文の概要: InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation

arxiv url: http://arxiv.org/abs/2604.08646v1
Date: Thu, 09 Apr 2026 17:59:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.517908
Title: InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation
Title（参考訳）: InsEdit: データ効率のよいビデオ拡散モデルによるインストラクションベースのビジュアル編集を目指して
Authors: Zhefan Rao, Bin Zou, Haoxuan Che, Xuanhua He, Chong Hou Choi, Yanheng Li, Rui Liu, Qifeng Chen,
Abstract要約: 本稿では,HunyuanVideo-1.5をベースとしたインストラクションベースの編集モデルであるInsEditを紹介する。 InsEditは、Mutual Context Attention (MCA)に基づくビジュアル編集アーキテクチャとビデオデータパイプラインを組み合わせる InsEditは,O(100)Kビデオ編集データのみを用いて,我々のビデオ命令編集ベンチマークにおけるオープンソース手法の最先端結果を実現する。
参考スコア（独自算出の注目度）: 47.1844759979843
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Instruction-based video editing is a natural way to control video content with text, but adapting a video generation model into an editor usually appears data-hungry. At the same time, high-quality video editing data remains scarce. In this paper, we show that a video generation backbone can become a strong video editor without large scale video editing data. We present InsEdit, an instruction-based editing model built on HunyuanVideo-1.5. InsEdit combines a visual editing architecture with a video data pipeline based on Mutual Context Attention (MCA), which creates aligned video pairs where edits can begin in the middle of a clip rather than only from the first frame. With only O(100)K video editing data, InsEdit achieves state-of-the-art results among open-source methods on our video instruction editing benchmarks. In addition, because our training recipe also includes image editing data, the final model supports image editing without any modification.
Abstract（参考訳）: インストラクションベースのビデオ編集は、テキストでビデオコンテンツを制御するための自然な方法であるが、ビデオ生成モデルをエディタに適応させると、通常、データに悩まされる。同時に、高品質のビデオ編集データも乏しい。本稿では,ビデオ生成のバックボーンが大規模なビデオ編集データなしで強力なビデオエディタになることを示す。本稿では,HunyuanVideo-1.5をベースとしたインストラクションベースの編集モデルであるInsEditを紹介する。 InsEditは、視覚的な編集アーキテクチャと、Mutual Context Attention (MCA)に基づくビデオデータパイプラインを組み合わせる。 InsEditは,O(100)Kビデオ編集データのみを用いて,我々のビデオ命令編集ベンチマークにおけるオープンソース手法の最先端結果を実現する。また、トレーニングレシピには画像編集データも含まれているため、最終モデルは修正することなく画像編集をサポートする。

論文の概要: InsEdit: Towards Instruction-based Visual Editing via Data-Efficient Video Diffusion Models Adaptation

関連論文リスト