Global2Local: A Joint-Hierarchical Attention for Video Captioning
- URL: http://arxiv.org/abs/2203.06663v1
- Date: Sun, 13 Mar 2022 14:31:54 GMT
- Title: Global2Local: A Joint-Hierarchical Attention for Video Captioning
- Authors: Chengpeng Dai, Fuhai Chen, Xiaoshuai Sun, Rongrong Ji, Qixiang Ye,
Yongjian Wu
- Abstract summary: We propose a novel joint-hierarchical attention model for video captioning, which embeds the key clips, the key frames and the key regions jointly into the captioning model.
Such a joint-hierarchical attention model first conducts a global selection to identify key frames, followed by a Gumbel sampling operation to identify further key regions based on the key frames.
- Score: 123.12188554567079
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, automatic video captioning has attracted increasing attention,
where the core challenge lies in capturing the key semantic items, like objects
and actions as well as their spatial-temporal correlations from the redundant
frames and semantic content. To this end, existing works select either the key
video clips in a global level~(across multi frames), or key regions within each
frame, which, however, neglect the hierarchical order, i.e., key frames first
and key regions latter. In this paper, we propose a novel joint-hierarchical
attention model for video captioning, which embeds the key clips, the key
frames and the key regions jointly into the captioning model in a hierarchical
manner. Such a joint-hierarchical attention model first conducts a global
selection to identify key frames, followed by a Gumbel sampling operation to
identify further key regions based on the key frames, achieving an accurate
global-to-local feature representation to guide the captioning. Extensive
quantitative evaluations on two public benchmark datasets MSVD and MSR-VTT
demonstrates the superiority of the proposed method over the state-of-the-art
methods.
Related papers
- A Challenging Multimodal Video Summary: Simultaneously Extracting and
Generating Keyframe-Caption Pairs from Video [20.579167394855197]
This paper proposes a practical multimodal video summarization task setting and dataset to train and evaluate the task.
The target task involves summarizing a given video into a number ofcaption pairs and displaying them in a listable format to grasp the video content quickly.
This task is useful as a practical application and presents a highly challenging problem worthy of study.
arXiv Detail & Related papers (2023-12-04T02:17:14Z) - Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching [77.97246496316515]
Open-Vocabulary Keypoint Detection (OVKD) task is innovatively designed to use text prompts for identifying arbitrary keypoints across any species.
We have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM)
This framework combines vision and language models, creating an interplay between language features and local keypoint visual features.
arXiv Detail & Related papers (2023-10-08T07:42:41Z) - Local-Global Associative Frame Assemble in Video Re-ID [57.7470971197962]
Noisy and unrepresentative frames in automatically generated object bounding boxes from video sequences cause challenges in learning discriminative representations in video re-identification (Re-ID)
Most existing methods tackle this problem by assessing the importance of video frames according to either their local part alignments or global appearance correlations separately.
In this work, we explore jointly both local alignments and global correlations with further consideration of their mutual promotion/reinforcement.
arXiv Detail & Related papers (2021-10-22T19:07:39Z) - Step-Wise Hierarchical Alignment Network for Image-Text Matching [29.07229472373576]
We propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process.
Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially.
arXiv Detail & Related papers (2021-06-11T17:05:56Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z) - Watching You: Global-guided Reciprocal Learning for Video-based Person
Re-identification [82.6971648465279]
We propose a novel Global-guided Reciprocal Learning framework for video-based person Re-ID.
Our approach can achieve better performance than other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-07T12:27:42Z) - Semantic Grouping Network for Video Captioning [11.777063873936598]
The SGN learns an algorithm to capture the most discriminating word phrases of the partially decoded caption.
The continuous feedback from decoded words enables the SGN to dynamically update the video representation that adapts to the partially decoded caption.
The SGN achieves state-of-the-art performances by outperforming runner-up methods by a margin of 2.1%p and 2.4%p in a CIDEr-D score on MSVD and MSR-VTT datasets.
arXiv Detail & Related papers (2021-02-01T13:40:56Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.