Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding
- URL: http://arxiv.org/abs/2311.08835v4
- Date: Wed, 3 Jul 2024 18:05:02 GMT
- Title: Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding
- Authors: WonJun Moon, Sangeek Hyun, SuBeen Lee, Jae-Pil Heo,
- Abstract summary: Correlation-Guided DEtection TRansformer provides clues for query-associated video clips.
CG-DETR achieves state-of-the-art results on various benchmarks for temporal grounding.
- Score: 22.59291334338824
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Temporal Grounding is to identify specific moments or highlights from a video corresponding to textual descriptions. Typical approaches in temporal grounding treat all video clips equally during the encoding process regardless of their semantic relevance with the text query. Therefore, we propose Correlation-Guided DEtection TRansformer (CG-DETR), exploring to provide clues for query-associated video clips within the cross-modal attention. First, we design an adaptive cross-attention with dummy tokens. Dummy tokens conditioned by text query take portions of the attention weights, preventing irrelevant video clips from being represented by the text query. Yet, not all words equally inherit the text query's correlation to video clips. Thus, we further guide the cross-attention map by inferring the fine-grained correlation between video clips and words. We enable this by learning a joint embedding space for high-level concepts, i.e., moment and sentence level, and inferring the clip-word correlation. Lastly, we exploit the moment-specific characteristics and combine them with the context of each video to form a moment-adaptive saliency detector. By exploiting the degrees of text engagement in each video clip, it precisely measures the highlightness of each clip. CG-DETR achieves state-of-the-art results on various benchmarks for temporal grounding. Codes are available at https://github.com/wjun0830/CGDETR.
Related papers
- Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding [17.110563457914324]
Video Temporal Grounding aims to identify visual frames in a video clip that match text queries.
Recent studies in VTG employ cross-attention to correlate visual frames and text queries as individual token sequences.
We introduce a visual frame-level gate mechanism that incorporates holistic textual information.
arXiv Detail & Related papers (2024-10-17T14:31:02Z) - Leveraging Temporal Contextualization for Video Action Recognition [47.8361303269338]
We propose a framework for video understanding called Temporally Contextualized CLIP (TC-CLIP)
We introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos.
The Video-Prompting (VP) module processes context tokens to generate informative prompts in the text modality.
arXiv Detail & Related papers (2024-04-15T06:24:56Z) - Weakly Supervised Video Representation Learning with Unaligned Text for
Sequential Videos [39.42509966219001]
This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided.
We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video.
Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
arXiv Detail & Related papers (2023-03-22T08:13:25Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - TempCLR: Temporal Alignment Representation with Contrastive Learning [35.12182087403215]
We propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly.
In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances.
arXiv Detail & Related papers (2022-12-28T08:10:31Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - HunYuan_tvr for Text-Video Retrivial [23.650824732136158]
HunYuan_tvr explores hierarchical cross-modal interactions by simultaneously exploring video-sentence, clip-phrase, and frame-word relationships.
HunYuan_tvr obtains new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 57.8%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet respectively.
arXiv Detail & Related papers (2022-04-07T11:59:36Z) - Video and Text Matching with Conditioned Embeddings [81.81028089100727]
We present a method for matching a text sentence from a given corpus to a given video clip and vice versa.
In this work, we encode the dataset data in a way that takes into account the query's relevant information.
We show that our conditioned representation can be transferred to video-guided machine translation, where we improved the current results on VATEX.
arXiv Detail & Related papers (2021-10-21T17:31:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.