Related papers: HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

URL: http://arxiv.org/abs/2406.07754v2
Date: Fri, 08 Nov 2024 21:35:16 GMT
Title: HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness
Authors: Zihui Xue, Mi Luo, Changan Chen, Kristen Grauman,
Abstract summary: We present HOI-Swap, a video editing framework trained in a self-supervised manner. The first stage focuses on object swapping in a single frame with HOI awareness. The second stage extends the single-frame edit across the entire sequence.
Score: 57.18183962641015
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the problem of precisely swapping objects in videos, with a focus on those interacted with by hands, given one user-provided reference object image. Despite the great advancements that diffusion models have made in video editing recently, these models often fall short in handling the intricacies of hand-object interactions (HOI), failing to produce realistic edits -- especially when object swapping results in object shape or functionality changes. To bridge this gap, we present HOI-Swap, a novel diffusion-based video editing framework trained in a self-supervised manner. Designed in two stages, the first stage focuses on object swapping in a single frame with HOI awareness; the model learns to adjust the interaction patterns, such as the hand grasp, based on changes in the object's properties. The second stage extends the single-frame edit across the entire sequence; we achieve controllable motion alignment with the original video by: (1) warping a new sequence from the stage-I edited frame based on sampled motion points and (2) conditioning video generation on the warped sequence. Comprehensive qualitative and quantitative evaluations demonstrate that HOI-Swap significantly outperforms existing methods, delivering high-quality video edits with realistic HOIs.

Related papers

MotionEdit: Benchmarking and Learning Motion-Centric Image Editing [81.28392925790568]
We introduce MotionEdit, a novel dataset for motion-centric image editing.<n>MotionEdit provides high-fidelity image pairs depicting realistic motion transformations extracted from continuous videos.<n>We propose MotionNFT to compute motion alignment rewards based on how well the motion flow between input and model-edited images matches the ground-truth motion.
arXiv Detail & Related papers (2025-12-11T04:53:58Z)
Generative Video Motion Editing with 3D Point Tracks [66.55707897151909]
We present a track-conditioned V2V framework that enables joint editing of camera and object motion.<n>We achieve this by conditioning a model on a source video and paired 3D point tracks representing source and target motions.<n>Our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation.
arXiv Detail & Related papers (2025-12-01T18:59:55Z)
Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model [72.90370736032115]
We present a novel video Reenactment framework focusing on Human-Object Interaction (HOI) via an adaptive layout-instructed Diffusion model (Re-HOLD) Our key insight is to employ specialized layout representation for hands and objects, respectively. To further improve the generation quality of HOI, we design an interactive textural enhancement module for both hands and objects.
arXiv Detail & Related papers (2025-03-21T08:40:35Z)
Edit as You See: Image-guided Video Editing via Masked Motion Modeling [18.89936405508778]
We propose a novel Image-guided Video Editing Diffusion model, termed IVEDiff. IVEDiff is built on top of image editing models, and is equipped with learnable motion modules to maintain the temporal consistency of edited video. Our method is able to generate temporally smooth edited videos while robustly dealing with various editing objects with high quality.
arXiv Detail & Related papers (2025-01-08T07:52:12Z)
Temporally Consistent Object Editing in Videos using Extended Attention [9.605596668263173]
We propose a method to edit videos using a pre-trained inpainting image diffusion model. We ensure that the edited information will be consistent across all the video frames.
arXiv Detail & Related papers (2024-06-01T02:31:16Z)
MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion [94.66090422753126]
MotionFollower is a lightweight score-guided diffusion model for video motion editing. It delivers superior motion editing performance and exclusively supports large camera movements and actions. Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory.
arXiv Detail & Related papers (2024-05-30T17:57:30Z)
Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection [60.47731445033151]
We propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model. Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.
arXiv Detail & Related papers (2024-05-27T04:44:36Z)
Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing [46.56615725175025]
We introduce Edit-Your-Motion, a video motion editing method that tackles unseen challenges through one-shot fine-tuning. To effectively decouple motion and appearance of source video, we design atemporal-two-stage learning strategy. With Edit-Your-Motion, users can edit the motion of humans in the source video, creating more engaging and diverse content.
arXiv Detail & Related papers (2024-05-07T17:06:59Z)
GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models [2.362412515574206]
We propose "GenVideo" for editing videos leveraging target-image aware T2I models. Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit.
arXiv Detail & Related papers (2024-04-18T23:25:27Z)
Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models [52.28245595257831]
Cross-attention guidance can be a promising approach for editing videos. We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos.
arXiv Detail & Related papers (2024-04-08T13:40:01Z)
VASE: Object-Centric Appearance and Shape Manipulation of Real Videos [108.60416277357712]
In this work, we introduce a framework that is object-centric and is designed to control both the object's appearance and, notably, to execute precise and explicit structural modifications on the object. We build our framework on a pre-trained image-conditioned diffusion model, integrate layers to handle the temporal dimension, and propose training strategies and architectural modifications to enable shape control. We evaluate our method on the image-driven video editing task showing similar performance to the state-of-the-art, and showcasing novel shape-editing capabilities.
arXiv Detail & Related papers (2024-01-04T18:59:24Z)
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence [37.85691662157054]
Video editing approaches that rely on dense correspondences are ineffective when the target edit involves a shape change. We introduce the VideoSwap framework, inspired by our observation that only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape. Extensive experiments demonstrate state-of-the-art video subject swapping results across a variety of real-world videos.
arXiv Detail & Related papers (2023-12-04T17:58:06Z)
Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.