Generative Video Diffusion for Unseen Cross-Domain Video Moment
Retrieval
- URL: http://arxiv.org/abs/2401.13329v2
- Date: Mon, 29 Jan 2024 10:38:36 GMT
- Title: Generative Video Diffusion for Unseen Cross-Domain Video Moment
Retrieval
- Authors: Dezhao Luo, Shaogang Gong, Jiabo Huang, Hailin Jin, Yang Liu
- Abstract summary: Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships.
Existing methods resort to joint training on both source and target domain videos for cross-domain applications.
We explore generative video diffusion for fine-grained editing of source videos controlled by the target sentences.
- Score: 58.17315970207874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Moment Retrieval (VMR) requires precise modelling of fine-grained
moment-text associations to capture intricate visual-language relationships.
Due to the lack of a diverse and generalisable VMR dataset to facilitate
learning scalable moment-text associations, existing methods resort to joint
training on both source and target domain videos for cross-domain applications.
Meanwhile, recent developments in vision-language multimodal models pre-trained
on large-scale image-text and/or video-text pairs are only based on coarse
associations (weakly labelled). They are inadequate to provide fine-grained
moment-text correlations required for cross-domain VMR. In this work, we solve
the problem of unseen cross-domain VMR, where certain visual and textual
concepts do not overlap across domains, by only utilising target domain
sentences (text prompts) without accessing their videos. To that end, we
explore generative video diffusion for fine-grained editing of source videos
controlled by the target sentences, enabling us to simulate target domain
videos. We address two problems in video editing for optimising unseen domain
VMR: (1) generation of high-quality simulation videos of different moments with
subtle distinctions, (2) selection of simulation videos that complement
existing source training videos without introducing harmful noise or
unnecessary repetitions. On the first problem, we formulate a two-stage video
diffusion generation controlled simultaneously by (1) the original video
structure of a source video, (2) subject specifics, and (3) a target sentence
prompt. This ensures fine-grained variations between video moments. On the
second problem, we introduce a hybrid selection mechanism that combines two
quantitative metrics for noise filtering and one qualitative metric for
leveraging VMR prediction on simulation video selection.
Related papers
- VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement [63.4357918830628]
VideoRepair is a model-agnostic, training-free video refinement framework.
It identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback.
VideoRepair substantially outperforms recent baselines across various text-video alignment metrics.
arXiv Detail & Related papers (2024-11-22T18:31:47Z) - Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels [34.88705952395676]
Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence)
We introduce a new approach called hybrid-learning video moment retrieval to solve the problem by knowledge transfer.
Our aim is to explore shared universal knowledge between the two domains in order to improve model learning in the weakly-labelled target domain.
arXiv Detail & Related papers (2024-06-03T21:14:53Z) - MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user.
Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process.
Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z) - Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain
Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos.
We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames.
Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z) - Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM.
We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Domain Adaptive Video Segmentation via Temporal Consistency
Regularization [32.77436219094282]
This paper presents DA-VSN, a domain adaptive video segmentation network that addresses domain gaps in videos by temporal consistency regularization (TCR)
The first is cross-domain TCR that guides the prediction of target frames to have similar temporal consistency as that of source frames (learnt from annotated source data) via adversarial learning.
The second is intra-domain TCR that guides unconfident predictions of target frames to have similar temporal consistency as confident predictions of target frames.
arXiv Detail & Related papers (2021-07-23T02:50:42Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.