Video Question Answering Using CLIP-Guided Visual-Text Attention
- URL: http://arxiv.org/abs/2303.03131v2
- Date: Wed, 8 Mar 2023 11:35:51 GMT
- Title: Video Question Answering Using CLIP-Guided Visual-Text Attention
- Authors: Shuhong Ye, Weikai Kong, Chenglin Yao, Jianfeng Ren, Xudong Jiang
- Abstract summary: Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA)
We propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs.
The proposed method is evaluated on MSVD-QA and MSRVTT-QA datasets, and outperforms state-of-the-art methods.
- Score: 17.43377106246301
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal learning of video and text plays a key role in Video Question
Answering (VideoQA). In this paper, we propose a visual-text attention
mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained
on lots of general domain language-image pairs to guide the cross-modal
learning for VideoQA. Specifically, we first extract video features using a
TimeSformer and text features using a BERT from the target application domain,
and utilize CLIP to extract a pair of visual-text features from the
general-knowledge domain through the domain-specific learning. We then propose
a Cross-domain Learning to extract the attention information between visual and
linguistic features across the target domain and general domain. The set of
CLIP-guided visual-text features are integrated to predict the answer. The
proposed method is evaluated on MSVD-QA and MSRVTT-QA datasets, and outperforms
state-of-the-art methods.
Related papers
- 3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation [13.622700558266658]
We propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction.
Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap.
Secondly, we add more cross-modal feature fusion in the pipeline to enhance the utilization of multi-modal information.
arXiv Detail & Related papers (2024-06-07T11:15:03Z) - Hybrid-Learning Video Moment Retrieval across Multi-Domain Labels [34.88705952395676]
Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence)
We introduce a new approach called hybrid-learning video moment retrieval to solve the problem by knowledge transfer.
Our aim is to explore shared universal knowledge between the two domains in order to improve model learning in the weakly-labelled target domain.
arXiv Detail & Related papers (2024-06-03T21:14:53Z) - Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition [84.31749632725929]
In this paper, we focus on one critical challenge of the task, namely scene bias, and accordingly contribute a novel scene-aware video-text alignment method.
Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains.
arXiv Detail & Related papers (2024-03-03T16:48:16Z) - A Review of Deep Learning for Video Captioning [111.1557921247882]
Video captioning (VC) is a fast-moving, cross-disciplinary area of research.
This survey covers deep learning-based VC, including but not limited to, attention-based architectures, graph networks, reinforcement learning, adversarial networks, dense video captioning (DVC)
arXiv Detail & Related papers (2023-04-22T15:30:54Z) - Learning video embedding space with Natural Language Supervision [1.6822770693792823]
We propose a novel approach to map video embedding space to natural langugage.
We propose a two-stage approach that first extracts visual features from each frame of a video using a pre-trained CNN, and then uses the CLIP model to encode the visual features for the video domain.
arXiv Detail & Related papers (2023-03-25T23:24:57Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning.
Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z) - Learning to Locate Visual Answer in Video Corpus Using Question [21.88924465126168]
We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate the visual answer in instructional videos.
We propose a cross-modal contrastive global-span (CCGS) method for the VCVAL, jointly training the video corpus retrieval and visual answer localization subtasks.
Experimental results show that the proposed method outperforms other competitive methods both in the video corpus retrieval and visual answer localization subtasks.
arXiv Detail & Related papers (2022-10-11T13:04:59Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Towards Visual-Prompt Temporal Answering Grounding in Medical
Instructional Video [21.88924465126168]
The temporal answering grounding in the video (TAGV) is a new task deriving from temporal sentence grounding in the video (TSGV)
Existing methods tend to formulate the TAGV task with a visual span-based question answering (QA) approach by matching the visual frame span queried by the text question.
We propose a visual-prompt text span localizing (VPTSL) method, which enhances the text span localization in the pre-trained language model (PLM) with the visual highlight features.
arXiv Detail & Related papers (2022-03-13T14:42:53Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.