ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions
- URL: http://arxiv.org/abs/2506.03107v2
- Date: Wed, 11 Jun 2025 06:11:36 GMT
- Title: ByteMorph: Benchmarking Instruction-Guided Image Editing with Non-Rigid Motions
- Authors: Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, Peng Wang,
- Abstract summary: We introduce ByteMorph, a framework for instruction-based image editing with an emphasis on non-rigid motions.<n>ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT)<n>Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories.
- Score: 48.20176284066248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Editing images with instructions to reflect non-rigid motions, camera viewpoint shifts, object deformations, human articulations, and complex interactions, poses a challenging yet underexplored problem in computer vision. Existing approaches and datasets predominantly focus on static scenes or rigid transformations, limiting their capacity to handle expressive edits involving dynamic motion. To address this gap, we introduce ByteMorph, a comprehensive framework for instruction-based image editing with an emphasis on non-rigid motions. ByteMorph comprises a large-scale dataset, ByteMorph-6M, and a strong baseline model built upon the Diffusion Transformer (DiT), named ByteMorpher. ByteMorph-6M includes over 6 million high-resolution image editing pairs for training, along with a carefully curated evaluation benchmark ByteMorph-Bench. Both capture a wide variety of non-rigid motion types across diverse environments, human figures, and object categories. The dataset is constructed using motion-guided data generation, layered compositing techniques, and automated captioning to ensure diversity, realism, and semantic coherence. We further conduct a comprehensive evaluation of recent instruction-based image editing methods from both academic and commercial domains.
Related papers
- ObjectMover: Generative Object Movement with Video Prior [69.75281888309017]
We present ObjectMover, a generative model that can perform object movement in challenging scenes.<n>We show that with this approach, our model is able to adjust to complex real-world scenarios.<n>We propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization.
arXiv Detail & Related papers (2025-03-11T04:42:59Z) - Instance-Level Moving Object Segmentation from a Single Image with Events [84.12761042512452]
Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects.<n>Previous methods encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion.<n>Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities.<n>We propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues.
arXiv Detail & Related papers (2025-02-18T15:56:46Z) - Edit as You See: Image-guided Video Editing via Masked Motion Modeling [18.89936405508778]
We propose a novel Image-guided Video Editing Diffusion model, termed IVEDiff.<n>IVEDiff is built on top of image editing models, and is equipped with learnable motion modules to maintain the temporal consistency of edited video.<n>Our method is able to generate temporally smooth edited videos while robustly dealing with various editing objects with high quality.
arXiv Detail & Related papers (2025-01-08T07:52:12Z) - Instruction-based Image Manipulation by Watching How Things Move [35.44993722444448]
We create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations.<n>Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.
arXiv Detail & Related papers (2024-12-16T18:56:17Z) - BrushEdit: All-In-One Image Inpainting and Editing [76.93556996538398]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm.<n>We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model.<n>Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - ObjectStitch: Generative Object Compositing [43.206123360578665]
We propose a self-supervised framework for object compositing using conditional diffusion models.
Our framework can transform the viewpoint, geometry, color and shadow of the generated object while requiring no manual labeling.
Our method outperforms relevant baselines in both realism and faithfulness of the synthesized result images in a user study on various real-world images.
arXiv Detail & Related papers (2022-12-02T02:15:13Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.