Leveraging Temporal Contextualization for Video Action Recognition
- URL: http://arxiv.org/abs/2404.09490v2
- Date: Wed, 24 Jul 2024 05:08:08 GMT
- Title: Leveraging Temporal Contextualization for Video Action Recognition
- Authors: Minji Kim, Dongyoon Han, Taekyung Kim, Bohyung Han,
- Abstract summary: We propose a framework for video understanding called Temporally Contextualized CLIP (TC-CLIP)
We introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos.
The Video-Prompting (VP) module processes context tokens to generate informative prompts in the text modality.
- Score: 47.8361303269338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel framework for video understanding, called Temporally Contextualized CLIP (TC-CLIP), which leverages essential temporal information through global interactions in a spatio-temporal domain within a video. To be specific, we introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos, which 1) extracts core information from each frame, 2) connects relevant information across frames for the summarization into context tokens, and 3) leverages the context tokens for feature encoding. Furthermore, the Video-conditional Prompting (VP) module processes context tokens to generate informative prompts in the text modality. Extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition validate the effectiveness of our model. Ablation studies for TC and VP support our design choices. Our project page with the source code is available at https://github.com/naver-ai/tc-clip
Related papers
- FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance [47.88160253507823]
We introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism.
CTGM incorporates the Temporal Information (TII), Temporal Affinity Refiner (TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of cross-attention.
arXiv Detail & Related papers (2024-08-15T14:47:44Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Correlation-Guided Query-Dependency Calibration for Video Temporal Grounding [22.59291334338824]
Correlation-Guided DEtection TRansformer provides clues for query-associated video clips.
CG-DETR achieves state-of-the-art results on various benchmarks for temporal grounding.
arXiv Detail & Related papers (2023-11-15T10:22:35Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - Discriminative Latent Semantic Graph for Video Captioning [24.15455227330031]
Video captioning aims to automatically generate natural language sentences that describe the visual contents of a given video.
Our main contribution is to identify three key problems in a joint framework for future video summarization tasks.
arXiv Detail & Related papers (2021-08-08T15:11:20Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.