Visual Commonsense-aware Representation Network for Video Captioning
- URL: http://arxiv.org/abs/2211.09469v1
- Date: Thu, 17 Nov 2022 11:27:15 GMT
- Title: Visual Commonsense-aware Representation Network for Video Captioning
- Authors: Pengpeng Zeng, Haonan Zhang, Lianli Gao, Xiangpeng Li, Jin Qian, Heng
Tao Shen
- Abstract summary: We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning.
Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
- Score: 84.67432867555044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating consecutive descriptions for videos, i.e., Video Captioning,
requires taking full advantage of visual representation along with the
generation process. Existing video captioning methods focus on making an
exploration of spatial-temporal representations and their relationships to
produce inferences. However, such methods only exploit the superficial
association contained in the video itself without considering the intrinsic
visual commonsense knowledge that existed in a video dataset, which may hinder
their capabilities of knowledge cognitive to reason accurate descriptions. To
address this problem, we propose a simple yet effective method, called Visual
Commonsense-aware Representation Network (VCRN), for video captioning.
Specifically, we construct a Video Dictionary, a plug-and-play component,
obtained by clustering all video features from the total dataset into multiple
clustered centers without additional annotation. Each center implicitly
represents a visual commonsense concept in the video domain, which is utilized
in our proposed Visual Concept Selection (VCS) to obtain a video-related
concept feature. Next, a Conceptual Integration Generation (CIG) is proposed to
enhance the caption generation. Extensive experiments on three publicly video
captioning benchmarks: MSVD, MSR-VTT, and VATEX, demonstrate that our method
reaches state-of-the-art performance, indicating the effectiveness of our
method. In addition, our approach is integrated into the existing method of
video question answering and improves this performance, further showing the
generalization of our method. Source code has been released at
https://github.com/zchoi/VCRN.
Related papers
- OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.
We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z) - Learning video embedding space with Natural Language Supervision [1.6822770693792823]
We propose a novel approach to map video embedding space to natural langugage.
We propose a two-stage approach that first extracts visual features from each frame of a video using a pre-trained CNN, and then uses the CLIP model to encode the visual features for the video domain.
arXiv Detail & Related papers (2023-03-25T23:24:57Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly instead of a large dataset.
We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z) - Learning Video Representations from Textual Web Supervision [97.78883761035557]
We propose to use text as a method for learning video representations.
We collect 70M video clips shared publicly on the Internet and train a model to pair each video with its associated text.
We find that this approach is an effective method of pre-training video representations.
arXiv Detail & Related papers (2020-07-29T16:19:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.