Instruction-based Image Manipulation by Watching How Things Move
- URL: http://arxiv.org/abs/2412.12087v1
- Date: Mon, 16 Dec 2024 18:56:17 GMT
- Title: Instruction-based Image Manipulation by Watching How Things Move
- Authors: Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, Zhihao Xia,
- Abstract summary: We create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations.<n>Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.
- Score: 35.44993722444448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics-such as non-rigid subject motion and complex camera movements-that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.
Related papers
- ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions [48.20176284066248]
We introduce ByteMorph, a framework for instruction-based image editing with an emphasis on non-rigid motions.<n>ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT)<n>Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories.
arXiv Detail & Related papers (2025-06-03T17:39:47Z) - Subject-driven Video Generation via Disentangled Identity and Motion [52.54835936914813]
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning.
Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings.
arXiv Detail & Related papers (2025-04-23T06:48:31Z) - Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance [2.5941932242768457]
Mask-guided video generation can control video generation through mask motion sequences.
Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control.
This approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality.
arXiv Detail & Related papers (2025-03-24T06:53:08Z) - VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation [67.31149310468801]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions.
VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z) - ObjectMover: Generative Object Movement with Video Prior [69.75281888309017]
We present ObjectMover, a generative model that can perform object movement in challenging scenes.
We show that with this approach, our model is able to adjust to complex real-world scenarios.
We propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization.
arXiv Detail & Related papers (2025-03-11T04:42:59Z) - Free-Form Motion Control: A Synthetic Video Generation Dataset with Controllable Camera and Object Motions [78.65431951506152]
We introduce a Synthetic dataset for Free-Form Motion Control (SynFMC)
The proposed SynFMC dataset includes diverse objects and environments and covers various motion patterns according to specific rules.
We further propose a method, Free-Form Motion Control (FMC), which enables independent or simultaneous control of object and camera movements.
arXiv Detail & Related papers (2025-01-02T18:59:45Z) - Video Decomposition Prior: A Methodology to Decompose Videos into Layers [74.36790196133505]
This paper introduces a novel video decomposition prior VDP' framework which derives inspiration from professional video editing practices.<n>VDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels.<n>We address tasks such as video object segmentation, dehazing, and relighting.
arXiv Detail & Related papers (2024-12-06T10:35:45Z) - SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing [50.098005973600024]
We propose a novel video generation and editing system powered by our Semantic Planning Agent (SPAgent)<n>SPAgent bridges the gap between diverse user intents and the effective utilization of existing generative models.<n> Experimental results demonstrate that the SPAgent effectively coordinates models to generate or edit videos.
arXiv Detail & Related papers (2024-11-28T08:07:32Z) - Transforming Static Images Using Generative Models for Video Salient Object Detection [15.701293552584863]
We show that image-to-video diffusion models can generate realistic transformations of static images while understanding the contextual relationships between image components.
This ability allows the model to generate plausible optical flows, preserving semantic integrity while reflecting the independent motion of scene elements.
Our approach achieves state-of-the-art performance across all public benchmark datasets, outperforming existing approaches.
arXiv Detail & Related papers (2024-11-21T09:41:33Z) - SGEdit: Bridging LLM with Text2Image Generative Model for Scene Graph-based Image Editing [42.23117201457898]
We introduce a new framework that integrates large language model (LLM) with Text2 generative model for graph-based image editing.
Our framework significantly outperforms existing image editing methods in terms of editing precision and scene aesthetics.
arXiv Detail & Related papers (2024-10-15T17:40:48Z) - InstructRL4Pix: Training Diffusion for Image Editing by Reinforcement Learning [31.799923647356458]
We propose Reinforcement Learning Guided Image Editing Method(InstructRL4Pix) to train a diffusion model to generate images that are guided by the attention maps of the target object.
Experimental results show that InstructRL4Pix breaks through the limitations of traditional datasets and uses unsupervised learning to optimize editing goals and achieve accurate image editing based on natural human commands.
arXiv Detail & Related papers (2024-06-14T12:31:48Z) - VASE: Object-Centric Appearance and Shape Manipulation of Real Videos [108.60416277357712]
In this work, we introduce a framework that is object-centric and is designed to control both the object's appearance and, notably, to execute precise and explicit structural modifications on the object.
We build our framework on a pre-trained image-conditioned diffusion model, integrate layers to handle the temporal dimension, and propose training strategies and architectural modifications to enable shape control.
We evaluate our method on the image-driven video editing task showing similar performance to the state-of-the-art, and showcasing novel shape-editing capabilities.
arXiv Detail & Related papers (2024-01-04T18:59:24Z) - SmartEdit: Exploring Complex Instruction-based Image Editing with
Multimodal Large Language Models [91.22477798288003]
This paper introduces SmartEdit, a novel approach to instruction-based image editing.
It exploits Multimodal Large Language Models (MLLMs) to enhance their understanding and reasoning capabilities.
We show that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions.
arXiv Detail & Related papers (2023-12-11T17:54:11Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - Structure and Content-Guided Video Synthesis with Diffusion Models [13.464501385061032]
We present a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output.
Our model is trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method.
arXiv Detail & Related papers (2023-02-06T18:50:23Z) - PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers
using Synthetic Scene Data [85.48684148629634]
We propose an approach to leverage synthetic scene data for improving video understanding.
We present a multi-task prompt learning approach for video transformers.
We show strong performance improvements on multiple video understanding tasks and datasets.
arXiv Detail & Related papers (2022-12-08T18:55:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.