Target-Aware Video Diffusion Models
- URL: http://arxiv.org/abs/2503.18950v2
- Date: Wed, 02 Apr 2025 14:11:15 GMT
- Title: Target-Aware Video Diffusion Models
- Authors: Taeksoo Kim, Hanbyul Joo,
- Abstract summary: We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target.<n>Unlike existing controllable image-to-video diffusion models that often rely on dense structural or motion cues to guide the actor's movements toward the target, our target-aware model requires only a simple mask to indicate the target.
- Score: 9.01174307678548
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a target-aware video diffusion model that generates videos from an input image in which an actor interacts with a specified target while performing a desired action. The target is defined by a segmentation mask and the desired action is described via a text prompt. Unlike existing controllable image-to-video diffusion models that often rely on dense structural or motion cues to guide the actor's movements toward the target, our target-aware model requires only a simple mask to indicate the target, leveraging the generalization capabilities of pretrained models to produce plausible actions. This makes our method particularly effective for human-object interaction (HOI) scenarios, where providing precise action guidance is challenging, and further enables the use of video diffusion models for high-level action planning in applications such as robotics. We build our target-aware model by extending a baseline model to incorporate the target mask as an additional input. To enforce target awareness, we introduce a special token that encodes the target's spatial information within the text prompt. We then fine-tune the model with our curated dataset using a novel cross-attention loss that aligns the cross-attention maps associated with this token with the input target mask. To further improve performance, we selectively apply this loss to the most semantically relevant transformer blocks and attention regions. Experimental results show that our target-aware model outperforms existing solutions in generating videos where actors interact accurately with the specified targets. We further demonstrate its efficacy in two downstream applications: video content creation and zero-shot 3D HOI motion synthesis.
Related papers
- Consistent Human Image and Video Generation with Spatially Conditioned Diffusion [82.4097906779699]
Consistent human-centric image and video synthesis aims to generate images with new poses while preserving appearance consistency with a given reference image.<n>We frame the task as a spatially-conditioned inpainting problem, where the target image is in-painted to maintain appearance consistency with the reference.<n>This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network.
arXiv Detail & Related papers (2024-12-19T05:02:30Z) - Stanceformer: Target-Aware Transformer for Stance Detection [59.69858080492586]
Stance Detection involves discerning the stance expressed in a text towards a specific subject or target.
Prior works have relied on existing transformer models that lack the capability to prioritize targets effectively.
We introduce Stanceformer, a target-aware transformer model that incorporates enhanced attention towards the targets during both training and inference.
arXiv Detail & Related papers (2024-10-09T17:24:28Z) - MotionCom: Automatic and Motion-Aware Image Composition with LLM and Video Diffusion Prior [51.672193627686]
MotionCom is a training-free motion-aware diffusion based image composition.
It enables seamless integration of target objects into new scenes with dynamically coherent results.
arXiv Detail & Related papers (2024-09-16T08:44:17Z) - TAFormer: A Unified Target-Aware Transformer for Video and Motion Joint Prediction in Aerial Scenes [14.924741503611749]
We introduce a novel task called Target-Aware Aerial Video Prediction, aiming to simultaneously predict future scenes and motion states of the target.
We introduce Spatiotemporal Attention (STA), which decouples the learning of video dynamics into spatial static attention and temporal dynamic attention, effectively modeling the scene appearance and motion.
To alleviate the difficulty of distinguishing targets in blurry predictions, we introduce Target-Sensitive Gaussian Loss (TSGL), enhancing the model's sensitivity to both target's position and content.
arXiv Detail & Related papers (2024-03-27T04:03:55Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Masked Diffusion with Task-awareness for Procedure Planning in
Instructional Videos [16.93979476655776]
A key challenge with procedure planning in instructional videos is how to handle a large decision space consisting of a multitude of action types.
We introduce a simple yet effective enhancement - a masked diffusion model.
We learn a joint visual-text embedding, where a text embedding is generated by prompting a pre-trained vision-language model to focus on human actions.
arXiv Detail & Related papers (2023-09-14T03:25:37Z) - Co-segmentation Inspired Attention Module for Video-based Computer
Vision Tasks [11.61956970623165]
We propose a generic module called "Co-Segmentation Module Activation" (COSAM) to promote the notion of co-segmentation based attention among a sequence of video frame features.
We show the application of COSAM in three video based tasks namely 1) Video-based person re-ID, 2) Video captioning, & 3) Video action classification.
arXiv Detail & Related papers (2021-11-14T15:35:37Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Knowing What, Where and When to Look: Efficient Video Action Modeling
with Attention [84.83632045374155]
Attentive video modeling is essential for action recognition in unconstrained videos.
What-Where-When (W3) video attention module models all three facets of video attention jointly.
Experiments show that our attention model brings significant improvements to existing action recognition models.
arXiv Detail & Related papers (2020-04-02T21:48:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.