Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!
- URL: http://arxiv.org/abs/2510.03550v2
- Date: Mon, 20 Oct 2025 01:47:32 GMT
- Title: Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime!
- Authors: Junbao Zhou, Yuan Zhou, Kesen Zhao, Qingshan Xu, Beier Zhu, Richang Hong, Hanwang Zhang,
- Abstract summary: We propose textbfstReaming drag-oriEnted interactiVe vidEo manipuLation (REVEL), a new task that enables users to modify generated videos emphanytime on emphanything via fine-grained, interactive drag.<n>Our method can be seamlessly integrated into existing autoregressive video diffusion models.
- Score: 88.12304235156591
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving streaming, fine-grained control over the outputs of autoregressive video diffusion models remains challenging, making it difficult to ensure that they consistently align with user expectations. To bridge this gap, we propose \textbf{stReaming drag-oriEnted interactiVe vidEo manipuLation (REVEL)}, a new task that enables users to modify generated videos \emph{anytime} on \emph{anything} via fine-grained, interactive drag. Beyond DragVideo and SG-I2V, REVEL unifies drag-style video manipulation as editing and animating video frames with both supporting user-specified translation, deformation, and rotation effects, making drag operations versatile. In resolving REVEL, we observe: \emph{i}) drag-induced perturbations accumulate in latent space, causing severe latent distribution drift that halts the drag process; \emph{ii}) streaming drag is easily disturbed by context frames, thereby yielding visually unnatural outcomes. We thus propose a training-free approach, \textbf{DragStream}, comprising: \emph{i}) an adaptive distribution self-rectification strategy that leverages neighboring frames' statistics to effectively constrain the drift of latent embeddings; \emph{ii}) a spatial-frequency selective optimization mechanism, allowing the model to fully exploit contextual information while mitigating its interference via selectively propagating visual cues along generation. Our method can be seamlessly integrated into existing autoregressive video diffusion models, and extensive experiments firmly demonstrate the effectiveness of our DragStream.
Related papers
- Tuning-free Visual Effect Transfer across Videos [91.93897438317397]
RefVFX is a framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner.<n>We introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video.<n>We show that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference.
arXiv Detail & Related papers (2026-01-12T18:59:32Z) - Beyond the Visible: Disocclusion-Aware Editing via Proxy Dynamic Graphs [39.496648478488666]
We address image-to-video generation with explicit user control over the final frame's disoccluded regions.<n>We introduce a lightweight, user-editable Proxy Dynamic Graph (PDG) that drives part motion, while a frozen diffusion prior is used to synthesize plausible appearance that follows that motion.<n>We then let the user edit appearance in the disoccluded areas of the image, and exploit the visibility information encoded by the PDG to perform a latent-space composite that reconciles motion with user intent in these areas.
arXiv Detail & Related papers (2025-12-15T14:45:05Z) - InstantDrag: Improving Interactivity in Drag-based Image Editing [23.004027029130953]
Drag-based image editing has recently gained popularity for its interactivity and precision.<n>We introduce InstantDrag, an optimization-free pipeline that enhances interactivity and speed.<n>We demonstrate InstantDrag's capability to perform fast, photo-realistic edits without masks or text prompts.
arXiv Detail & Related papers (2024-09-13T14:19:27Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.<n>We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.<n>COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - DragTraffic: Interactive and Controllable Traffic Scene Generation for Autonomous Driving [10.90477019946728]
DragTraffic is a general, interactive, and controllable traffic scene generation framework based on conditional diffusion.
We employ a regression model to provide a general initial solution and a refinement process based on the conditional diffusion model to ensure diversity.
Experiments on a real-world driving dataset show that DragTraffic outperforms existing methods in terms of authenticity, diversity, and freedom.
arXiv Detail & Related papers (2024-04-19T04:49:28Z) - DragVideo: Interactive Drag-style Video Editing [58.59845960686982]
DragVideo is a general drag-temporal video editing framework.
It can edit video in an intuitive, faithful to the user's intention manner, with nearly unnoticeable distortion and artifacts, while maintaining-temporal consistency.
While traditional prompt-based video editing to do the former two and directly applying image drag editing fails in the last, DragVideo's versatility and generality are emphasized.
arXiv Detail & Related papers (2023-12-03T10:41:06Z) - TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.