ObjectMover: Generative Object Movement with Video Prior
- URL: http://arxiv.org/abs/2503.08037v1
- Date: Tue, 11 Mar 2025 04:42:59 GMT
- Title: ObjectMover: Generative Object Movement with Video Prior
- Authors: Xin Yu, Tianyu Wang, Soo Ye Kim, Paul Guerrero, Xi Chen, Qing Liu, Zhe Lin, Xiaojuan Qi,
- Abstract summary: We present ObjectMover, a generative model that can perform object movement in challenging scenes.<n>We show that with this approach, our model is able to adjust to complex real-world scenarios.<n>We propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization.
- Score: 69.75281888309017
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Simple as it seems, moving an object to another location within an image is, in fact, a challenging image-editing task that requires re-harmonizing the lighting, adjusting the pose based on perspective, accurately filling occluded regions, and ensuring coherent synchronization of shadows and reflections while maintaining the object identity. In this paper, we present ObjectMover, a generative model that can perform object movement in highly challenging scenes. Our key insight is that we model this task as a sequence-to-sequence problem and fine-tune a video generation model to leverage its knowledge of consistent object generation across video frames. We show that with this approach, our model is able to adjust to complex real-world scenarios, handling extreme lighting harmonization and object effect movement. As large-scale data for object movement are unavailable, we construct a data generation pipeline using a modern game engine to synthesize high-quality data pairs. We further propose a multi-task learning strategy that enables training on real-world video data to improve the model generalization. Through extensive experiments, we demonstrate that ObjectMover achieves outstanding results and adapts well to real-world scenarios.
Related papers
- MirrorVerse: Pushing Diffusion Models to Realistically Reflect the World [28.104534788009865]
Diffusion models often fail to fully adhere to physical laws.
We introduce a three-stage training curriculum to develop the MirrorFusion 2.0 model to improve real-world performance.
arXiv Detail & Related papers (2025-04-21T19:01:02Z) - Instance-Level Moving Object Segmentation from a Single Image with Events [84.12761042512452]
Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects.<n>Previous methods encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion.<n>Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities.<n>We propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues.
arXiv Detail & Related papers (2025-02-18T15:56:46Z) - Transforming Static Images Using Generative Models for Video Salient Object Detection [15.701293552584863]
We show that image-to-video diffusion models can generate realistic transformations of static images while understanding the contextual relationships between image components.
This ability allows the model to generate plausible optical flows, preserving semantic integrity while reflecting the independent motion of scene elements.
Our approach achieves state-of-the-art performance across all public benchmark datasets, outperforming existing approaches.
arXiv Detail & Related papers (2024-11-21T09:41:33Z) - Rethinking Image-to-Video Adaptation: An Object-centric Perspective [61.833533295978484]
We propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective.
Inspired by human perception, we integrate a proxy task of object discovery into image-to-video transfer learning.
arXiv Detail & Related papers (2024-07-09T13:58:10Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object
Video Generation [26.292052071093945]
We propose an unsupervised method to generate videos from a single frame and a sparse motion input.
Our trained model can generate unseen realistic object-to-object interactions.
We show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.
arXiv Detail & Related papers (2023-06-06T19:50:02Z) - ObjectStitch: Generative Object Compositing [43.206123360578665]
We propose a self-supervised framework for object compositing using conditional diffusion models.
Our framework can transform the viewpoint, geometry, color and shadow of the generated object while requiring no manual labeling.
Our method outperforms relevant baselines in both realism and faithfulness of the synthesized result images in a user study on various real-world images.
arXiv Detail & Related papers (2022-12-02T02:15:13Z) - Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels.
Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions.
We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.