Related papers: Motion-Aware Concept Alignment for Consistent Video Editing

Motion-Aware Concept Alignment for Consistent Video Editing

URL: http://arxiv.org/abs/2506.01004v1
Date: Sun, 01 Jun 2025 13:28:04 GMT
Title: Motion-Aware Concept Alignment for Consistent Video Editing
Authors: Tong Zhang, Juan C Leon Alcazar, Bernard Ghanem,
Abstract summary: We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video.<n>Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video.<n>We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames
Score: 57.08108545219043
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video. Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video, while preserving the original motion and visual context. Our approach leverages a diagonal denoising schedule and class-agnostic segmentation to detect and track objects in the latent space and precisely control the spatial location of the blended objects. To ensure temporal coherence, we incorporate momentum-based semantic corrections and gamma residual noise stabilization for smooth frame transitions. We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames. Using self-constructed dataset, MoCA-Video outperforms current baselines, achieving superior spatial consistency, coherent motion, and a significantly higher CASS score, despite having no training or fine-tuning. MoCA-Video demonstrates that structured manipulation in the diffusion noise trajectory allows for controllable, high-quality video synthesis.

Related papers

Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z)
MCDS-VSS: Moving Camera Dynamic Scene Video Semantic Segmentation by Filtering with Self-Supervised Geometry and Motion [17.50161162624179]
Self-driving cars rely on reliable semantic environment perception for decision making. We propose MCDS-VSS, a structured filter model that learns in a self-supervised manner to estimate scene geometry and ego-motion of the camera. Our model parses automotive scenes into multiple interpretable representations such as scene geometry, ego-motion, and object motion.
arXiv Detail & Related papers (2024-05-30T10:33:14Z)
Video Dynamics Prior: An Internal Learning Approach for Robust Video Enhancements [83.5820690348833]
We present a framework for low-level vision tasks that does not require any external training data corpus. Our approach learns neural modules by optimizing over a corrupted sequence, leveraging the weights of the coherence-temporal test and statistics internal statistics.
arXiv Detail & Related papers (2023-12-13T01:57:11Z)
Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning [16.094271750354835]
Motion information is critical to a robust and generalized video representation. Recent works have adopted frame difference as the source of motion information in video contrastive learning. We present a framework capable of introducing well-aligned and significant motion information.
arXiv Detail & Related papers (2023-09-01T07:03:27Z)
Implicit Motion-Compensated Network for Unsupervised Video Object Segmentation [25.41427065435164]
Unsupervised video object segmentation (UVOS) aims at automatically separating the primary foreground object(s) from the background in a video sequence. Existing UVOS methods either lack robustness when there are visually similar surroundings (appearance-based) or suffer from deterioration in the quality of their predictions because of dynamic background and inaccurate flow (flow-based) We propose an implicit motion-compensated network (IMCNet) combining complementary cues ($textiti.e.$, appearance and motion) with aligned motion information from the adjacent frames to the current frame at the feature level.
arXiv Detail & Related papers (2022-04-06T13:03:59Z)
Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework. It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z)
PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object. PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects. We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z)
Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild. Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation. Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.