DreamRelation: Relation-Centric Video Customization
- URL: http://arxiv.org/abs/2503.07602v1
- Date: Mon, 10 Mar 2025 17:58:03 GMT
- Title: DreamRelation: Relation-Centric Video Customization
- Authors: Yujie Wei, Shiwei Zhang, Hangjie Yuan, Biao Gong, Longxiang Tang, Xiang Wang, Haonan Qiu, Hengjia Li, Shuai Tan, Yingya Zhang, Hongming Shan,
- Abstract summary: Video customization refers to the creation of personalized videos that depict user-specified relations between two subjects.<n>While existing methods can personalize subject appearances and motions, they still struggle with complex video customization.<n>We propose DreamRelation, a novel approach capturing a small set of videos, leveraging two key components: Decoupling Learning and Dynamics Enhancement.
- Score: 33.65405972817795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Relational video customization refers to the creation of personalized videos that depict user-specified relations between two subjects, a crucial task for comprehending real-world visual content. While existing methods can personalize subject appearances and motions, they still struggle with complex relational video customization, where precise relational modeling and high generalization across subject categories are essential. The primary challenge arises from the intricate spatial arrangements, layout variations, and nuanced temporal dynamics inherent in relations; consequently, current models tend to overemphasize irrelevant visual details rather than capturing meaningful interactions. To address these challenges, we propose DreamRelation, a novel approach that personalizes relations through a small set of exemplar videos, leveraging two key components: Relational Decoupling Learning and Relational Dynamics Enhancement. First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships. Furthermore, we determine the optimal design of relation LoRA triplet by analyzing the distinct roles of the query, key, and value features within MM-DiT's attention mechanism, making DreamRelation the first relational video generation framework with explainable components. Second, in Relational Dynamics Enhancement, we introduce space-time relational contrastive loss, which prioritizes relational dynamics while minimizing the reliance on detailed subject appearances. Extensive experiments demonstrate that DreamRelation outperforms state-of-the-art methods in relational video customization. Code and models will be made publicly available.
Related papers
- Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition [53.02634128715853]
Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars.
We propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR.
It unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view.
arXiv Detail & Related papers (2025-04-14T10:23:22Z) - KGIF: Optimizing Relation-Aware Recommendations with Knowledge Graph Information Fusion [16.971592142597544]
This study introduces a specialized framework designed to merge entity and relation embeddings explicitly through a tailored self-attention mechanism.<n>This explicit fusion enhances the interplay between user-item interactions and item-attribute relationships, providing a nuanced balance between user-centric and item-centric representations.<n>The contributions of this work include an innovative method for explicit information fusion, improved robustness for sparse knowledge graphs, and the ability to generate explainable recommendations through interpretable path visualization.
arXiv Detail & Related papers (2025-01-07T22:19:15Z) - RelationVLM: Making Large Vision-Language Models Understand Visual Relations [66.70252936043688]
We present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video.
Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations.
arXiv Detail & Related papers (2024-03-19T15:01:19Z) - Open-Vocabulary Video Relation Extraction [37.40717383505057]
We introduce Open-vocabulary Video Relation Extraction (OVRE), a novel task that views action understanding through the lens of action-centric relation triplets.
OVRE focuses on pairwise relations that take part in the action and describes these relation triplets with natural languages.
We curate the Moments-OVRE dataset, which comprises 180K videos with action-centric relation triplets, sourced from a multi-label action classification dataset.
arXiv Detail & Related papers (2023-12-25T09:29:34Z) - Cross-Modality Time-Variant Relation Learning for Generating Dynamic
Scene Graphs [16.760066844287046]
We propose a Time-variant Relation-aware TRansformer (TR$2$) to model the temporal change of relations in dynamic scene graphs.
We show that TR$2$ significantly outperforms previous state-of-the-art methods under two different settings.
arXiv Detail & Related papers (2023-05-15T10:30:38Z) - Relational Self-Attention: What's Missing in Attention for Video
Understanding [52.38780998425556]
We introduce a relational feature transform, dubbed the relational self-attention (RSA)
Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts.
arXiv Detail & Related papers (2021-11-02T15:36:11Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - Relation-aware Hierarchical Attention Framework for Video Question
Answering [6.312182279855817]
We propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos.
In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features.
We consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer.
arXiv Detail & Related papers (2021-05-13T09:35:42Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Multi-Granularity Reference-Aided Attentive Feature Aggregation for
Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips.
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA.
Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.