InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction
- URL: http://arxiv.org/abs/2503.20287v1
- Date: Wed, 26 Mar 2025 07:30:58 GMT
- Title: InsViE-1M: Effective Instruction-based Video Editing with Elaborate Dataset Construction
- Authors: Yuhui Wu, Liyi Chen, Ruibin Li, Shihao Wang, Chenxi Xie, Lei Zhang,
- Abstract summary: We present a high-quality Instruction-based Video Editing dataset with 1M triplets, namely InsViE-1M.<n>We first curate high-resolution and high-quality source videos and images, then design an effective editing-filtering pipeline to construct high-quality editing triplets for model training.
- Score: 10.855393943204728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction-based video editing allows effective and interactive editing of videos using only instructions without extra inputs such as masks or attributes. However, collecting high-quality training triplets (source video, edited video, instruction) is a challenging task. Existing datasets mostly consist of low-resolution, short duration, and limited amount of source videos with unsatisfactory editing quality, limiting the performance of trained editing models. In this work, we present a high-quality Instruction-based Video Editing dataset with 1M triplets, namely InsViE-1M. We first curate high-resolution and high-quality source videos and images, then design an effective editing-filtering pipeline to construct high-quality editing triplets for model training. For a source video, we generate multiple edited samples of its first frame with different intensities of classifier-free guidance, which are automatically filtered by GPT-4o with carefully crafted guidelines. The edited first frame is propagated to subsequent frames to produce the edited video, followed by another round of filtering for frame quality and motion evaluation. We also generate and filter a variety of video editing triplets from high-quality images. With the InsViE-1M dataset, we propose a multi-stage learning strategy to train our InsViE model, progressively enhancing its instruction following and editing ability. Extensive experiments demonstrate the advantages of our InsViE-1M dataset and the trained model over state-of-the-art works. Codes are available at InsViE.
Related papers
- VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation [67.31149310468801]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions.<n> VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z) - SeƱorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists [17.451911831989293]
We introduce Senorita-2M, a high-quality video editing dataset.<n>It is built by crafting four high-quality, specialized video editing models.<n>We propose a filtering pipeline to eliminate poorly edited video pairs.
arXiv Detail & Related papers (2025-02-10T17:58:22Z) - DIVE: Taming DINO for Subject-Driven Video Editing [49.090071984272576]
DINO-guided Video Editing (DIVE) is a framework designed to facilitate subject-driven editing in source videos.<n>DIVE employs DINO features to align with the motion trajectory of the source video.<n>For precise subject editing, DIVE incorporates the DINO features of reference images into a pretrained text-to-image model.
arXiv Detail & Related papers (2024-12-04T14:28:43Z) - AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea [88.79769371584491]
We present AnyEdit, a comprehensive multi-modal instruction editing dataset.<n>We ensure the diversity and quality of the AnyEdit collection through three aspects: initial data diversity, adaptive editing process, and automated selection of editing results.<n>Experiments on three benchmark datasets show that AnyEdit consistently boosts the performance of diffusion-based editing models.
arXiv Detail & Related papers (2024-11-24T07:02:56Z) - Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection [60.47731445033151]
We propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model.
Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.
arXiv Detail & Related papers (2024-05-27T04:44:36Z) - I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model.
Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z) - AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks [41.640692114423544]
We introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing.
AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks.
Our evaluation shows that AnyV2V achieved CLIP-scores comparable to other baseline methods.
arXiv Detail & Related papers (2024-03-21T15:15:00Z) - EffiVED:Efficient Video Editing via Text-instruction Diffusion Models [9.287394166165424]
We introduce EffiVED, an efficient diffusion-based model that supports instruction-guided video editing.
We transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED.
arXiv Detail & Related papers (2024-03-18T08:42:08Z) - VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion
Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks.
Our model can edit and translate the desired results within seconds based on user instructions.
We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.