Point-to-Point: Sparse Motion Guidance for Controllable Video Editing
- URL: http://arxiv.org/abs/2511.18277v1
- Date: Sun, 23 Nov 2025 03:59:59 GMT
- Title: Point-to-Point: Sparse Motion Guidance for Controllable Video Editing
- Authors: Yeji Song, Jaehyun Lee, Mijin Koo, JunHoo Lee, Nojun Kwak,
- Abstract summary: We propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model.<n>In experiments, anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.
- Score: 29.888408281118846
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurately preserving motion while editing a subject remains a core challenge in video editing tasks. Existing methods often face a trade-off between edit and motion fidelity, as they rely on motion representations that are either overfitted to the layout or only implicitly defined. To overcome this limitation, we revisit point-based motion representation. However, identifying meaningful points remains challenging without human input, especially across diverse video scenarios. To address this, we propose a novel motion representation, anchor tokens, that capture the most essential motion patterns by leveraging the rich prior of a video diffusion model. Anchor tokens encode video dynamics compactly through a small number of informative point trajectories and can be flexibly relocated to align with new subjects. This allows our method, Point-to-Point, to generalize across diverse scenarios. Extensive experiments demonstrate that anchor tokens lead to more controllable and semantically aligned video edits, achieving superior performance in terms of edit and motion fidelity.
Related papers
- Generative Video Motion Editing with 3D Point Tracks [66.55707897151909]
We present a track-conditioned V2V framework that enables joint editing of camera and object motion.<n>We achieve this by conditioning a model on a source video and paired 3D point tracks representing source and target motions.<n>Our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation.
arXiv Detail & Related papers (2025-12-01T18:59:55Z) - MotionV2V: Editing Motion in a Video [53.791975554391534]
We propose modifying video motion by editing sparse trajectories extracted from the input.<n>We term the deviation between input and output trajectories a "motion edit"<n>Our approach allows for edits that start at any timestamp and propagate naturally.
arXiv Detail & Related papers (2025-11-25T18:57:25Z) - ConMo: Controllable Motion Disentanglement and Recomposition for Zero-Shot Motion Transfer [44.33224798292861]
ConMo is a framework that disentangles and recomposes the motions of subjects and camera movements.<n>It enables more accurate motion control across diverse subjects and improves performance in multi-subject scenarios.<n>ConMo unlocks a wide range of applications, including subject size and position editing, subject removal, semantic modifications, and camera motion simulation.
arXiv Detail & Related papers (2025-04-03T10:15:52Z) - Segment Any Motion in Videos [80.72424676419755]
We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features.<n>Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support.
arXiv Detail & Related papers (2025-03-28T09:34:11Z) - VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control [66.66226299852559]
VideoAnydoor is a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control.<n>To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper.
arXiv Detail & Related papers (2025-01-02T18:59:54Z) - Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion [9.134743677331517]
We propose a pre-trained image-to-video model to disentangle appearance from motion.<n>Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input.<n>By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity.<n>Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment.
arXiv Detail & Related papers (2024-08-01T10:55:20Z) - MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion [94.66090422753126]
MotionFollower is a lightweight score-guided diffusion model for video motion editing.
It delivers superior motion editing performance and exclusively supports large camera movements and actions.
Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory.
arXiv Detail & Related papers (2024-05-30T17:57:30Z) - Drag-A-Video: Non-rigid Video Editing with Point-based Interaction [63.78538355189017]
We propose a new diffusion-based method for interactive point-based video manipulation, called Drag-A-Video.
Our method allows users to click pairs of handle points and target points as well as masks on the first frame of an input video.
To precisely modify the contents of the video, we employ a new video-level motion supervision to update the features of the video.
arXiv Detail & Related papers (2023-12-05T18:05:59Z) - SAVE: Protagonist Diversification with Structure Agnostic Video Editing [29.693364686494274]
Previous works usually work well on trivial and consistent shapes, and easily collapse on a difficult target that has a largely different body shape from the original one.
We propose motion personalization that isolates the motion from a single source video and then modifies the protagonist accordingly.
We also regulate the motion word to attend to proper motion-related areas by introducing a novel pseudo optical flow.
arXiv Detail & Related papers (2023-12-05T05:13:20Z) - VideoSwap: Customized Video Subject Swapping with Interactive Semantic
Point Correspondence [37.85691662157054]
Video editing approaches that rely on dense correspondences are ineffective when the target edit involves a shape change.
We introduce the VideoSwap framework, inspired by our observation that only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape.
Extensive experiments demonstrate state-of-the-art video subject swapping results across a variety of real-world videos.
arXiv Detail & Related papers (2023-12-04T17:58:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.