GL-RG: Global-Local Representation Granularity for Video Captioning
- URL: http://arxiv.org/abs/2205.10706v1
- Date: Sun, 22 May 2022 02:00:09 GMT
- Title: GL-RG: Global-Local Representation Granularity for Video Captioning
- Authors: Liqi Yan, Qifan Wang, Yiming Cui, Fuli Feng, Xiaojun Quan, Xiangyu
Zhang, Dongfang Liu
- Abstract summary: We propose a GL-RG framework for video captioning, namely a textbfGlobal-textbfLocal textbfRepresentation textbfGranularity.
Our GL-RG demonstrates three advantages over the prior efforts: 1) we explicitly exploit extensive visual representations from different video ranges to improve linguistic expression; 2) we devise a novel global-local encoder to produce rich semantic vocabulary to obtain a descriptive granularity of video contents across frames; and 3) we develop an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning
- Score: 52.56883051799501
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video captioning is a challenging task as it needs to accurately transform
visual understanding into natural language description. To date,
state-of-the-art methods inadequately model global-local representation across
video frames for caption generation, leaving plenty of room for improvement. In
this work, we approach the video captioning task from a new perspective and
propose a GL-RG framework for video captioning, namely a
\textbf{G}lobal-\textbf{L}ocal \textbf{R}epresentation \textbf{G}ranularity.
Our GL-RG demonstrates three advantages over the prior efforts: 1) we
explicitly exploit extensive visual representations from different video ranges
to improve linguistic expression; 2) we devise a novel global-local encoder to
produce rich semantic vocabulary to obtain a descriptive granularity of video
contents across frames; 3) we develop an incremental training strategy which
organizes model learning in an incremental fashion to incur an optimal
captioning behavior. Experimental results on the challenging MSR-VTT and MSVD
datasets show that our DL-RG outperforms recent state-of-the-art methods by a
significant margin. Code is available at \url{https://github.com/ylqi/GL-RG}.
Related papers
- DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval [73.82017200889906]
Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query.
We propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention.
In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts.
arXiv Detail & Related papers (2024-01-19T09:58:06Z) - Videoprompter: an ensemble of foundational models for zero-shot video
understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations.
We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z) - Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.
We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - LGDN: Language-Guided Denoising Network for Video-Language Modeling [30.99646752913056]
We propose an efficient and effective model, termed Language-Guided Denoising Network (LGDN) for video-language modeling.
Our LGDN dynamically filters out the misaligned or redundant frames under the language supervision and obtains only 2--4 salient frames per video for cross-modal token-level alignment.
arXiv Detail & Related papers (2022-09-23T03:35:59Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z) - Discriminative Latent Semantic Graph for Video Captioning [24.15455227330031]
Video captioning aims to automatically generate natural language sentences that describe the visual contents of a given video.
Our main contribution is to identify three key problems in a joint framework for future video summarization tasks.
arXiv Detail & Related papers (2021-08-08T15:11:20Z) - ActBERT: Learning Global-Local Video-Text Representations [74.29748531654474]
We introduce ActBERT for self-supervised learning of joint video-text representations from unlabeled data.
We leverage global action information to catalyze the mutual interactions between linguistic texts and local regional objects.
ActBERT significantly outperforms the state-of-the-arts, demonstrating its superiority in video-text representation learning.
arXiv Detail & Related papers (2020-11-14T07:14:08Z) - Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [72.52804406378023]
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web.
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning model, which decomposes video-text matching into global-to-local levels.
arXiv Detail & Related papers (2020-03-01T03:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.