Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video
Grounding
- URL: http://arxiv.org/abs/2312.13633v1
- Date: Thu, 21 Dec 2023 07:49:27 GMT
- Title: Multi-Modal Domain Adaptation Across Video Scenes for Temporal Video
Grounding
- Authors: Haifeng Huang, Yang Zhao, Zehan Wang, Yan Xia, Zhou Zhao
- Abstract summary: Temporal Video Grounding (TVG) aims to localize the temporal boundary of a specific segment in an untrimmed video based on a given language query.
We introduce a novel AMDA method to adaptively adjust the model's scene-related knowledge by incorporating insights from the target data.
- Score: 59.599378814835205
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal Video Grounding (TVG) aims to localize the temporal boundary of a
specific segment in an untrimmed video based on a given language query. Since
datasets in this domain are often gathered from limited video scenes, models
tend to overfit to scene-specific factors, which leads to suboptimal
performance when encountering new scenes in real-world applications. In a new
scene, the fine-grained annotations are often insufficient due to the expensive
labor cost, while the coarse-grained video-query pairs are easier to obtain.
Thus, to address this issue and enhance model performance on new scenes, we
explore the TVG task in an unsupervised domain adaptation (UDA) setting across
scenes for the first time, where the video-query pairs in the source scene
(domain) are labeled with temporal boundaries, while those in the target scene
are not. Under the UDA setting, we introduce a novel Adversarial Multi-modal
Domain Adaptation (AMDA) method to adaptively adjust the model's scene-related
knowledge by incorporating insights from the target data. Specifically, we
tackle the domain gap by utilizing domain discriminators, which help identify
valuable scene-related features effective across both domains. Concurrently, we
mitigate the semantic gap between different modalities by aligning video-query
pairs with related semantics. Furthermore, we employ a mask-reconstruction
approach to enhance the understanding of temporal semantics within a scene.
Extensive experiments on Charades-STA, ActivityNet Captions, and YouCook2
demonstrate the effectiveness of our proposed method.
Related papers
- ActPrompt: In-Domain Feature Adaptation via Action Cues for Video Temporal Grounding [40.60371529725805]
We propose an efficient preliminary in-domain fine-tuning paradigm for feature adaptation.
We introduce Action-Cue-Injected Temporal Prompt Learning (ActPrompt), which injects action cues into the image encoder of VLM for better discovering action-sensitive patterns.
arXiv Detail & Related papers (2024-08-13T04:18:32Z) - Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels [34.88705952395676]
Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence)
We introduce a new approach called hybrid-learning video moment retrieval to solve the problem by knowledge transfer.
Our aim is to explore shared universal knowledge between the two domains in order to improve model learning in the weakly-labelled target domain.
arXiv Detail & Related papers (2024-06-03T21:14:53Z) - Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain
Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos.
We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames.
Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [55.574102714832456]
spatial-temporal grounding describes the task of localizing events in space and time.
Models for this task are usually trained with human-annotated sentences and bounding box supervision.
We combine local representation learning, which focuses on fine-grained spatial information, with a global representation that captures higher-level representations.
arXiv Detail & Related papers (2023-03-29T19:38:23Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Scene Consistency Representation Learning for Video Scene Segmentation [26.790491577584366]
We propose an effective Self-Supervised Learning (SSL) framework to learn better shot representations from long-term videos.
We present an SSL scheme to achieve scene consistency, while exploring considerable data augmentation and shuffling methods to boost the model generalizability.
Our method achieves the state-of-the-art performance on the task of Video Scene.
arXiv Detail & Related papers (2022-05-11T13:31:15Z) - Domain Adaptive Video Segmentation via Temporal Consistency
Regularization [32.77436219094282]
This paper presents DA-VSN, a domain adaptive video segmentation network that addresses domain gaps in videos by temporal consistency regularization (TCR)
The first is cross-domain TCR that guides the prediction of target frames to have similar temporal consistency as that of source frames (learnt from annotated source data) via adversarial learning.
The second is intra-domain TCR that guides unconfident predictions of target frames to have similar temporal consistency as confident predictions of target frames.
arXiv Detail & Related papers (2021-07-23T02:50:42Z) - DRIV100: In-The-Wild Multi-Domain Dataset and Evaluation for Real-World
Domain Adaptation of Semantic Segmentation [9.984696742463628]
This work presents a new multi-domain dataset datasetnamefor benchmarking domain adaptation techniques on in-the-wild road-scene videos collected from the Internet.
The dataset consists of pixel-level annotations for 100 videos selected to cover diverse scenes/domains based on two criteria; human subjective judgment and an anomaly score judged using an existing road-scene dataset.
arXiv Detail & Related papers (2021-01-30T04:43:22Z) - Boundary-sensitive Pre-training for Temporal Localization in Videos [124.40788524169668]
We investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext ( BSP) task.
With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types.
Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart.
arXiv Detail & Related papers (2020-11-21T17:46:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.