Global2Local: A Joint-Hierarchical Attention for Video Captioning
- URL: http://arxiv.org/abs/2203.06663v1
- Date: Sun, 13 Mar 2022 14:31:54 GMT
- Title: Global2Local: A Joint-Hierarchical Attention for Video Captioning
- Authors: Chengpeng Dai, Fuhai Chen, Xiaoshuai Sun, Rongrong Ji, Qixiang Ye,
Yongjian Wu
- Abstract summary: We propose a novel joint-hierarchical attention model for video captioning, which embeds the key clips, the key frames and the key regions jointly into the captioning model.
Such a joint-hierarchical attention model first conducts a global selection to identify key frames, followed by a Gumbel sampling operation to identify further key regions based on the key frames.
- Score: 123.12188554567079
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, automatic video captioning has attracted increasing attention,
where the core challenge lies in capturing the key semantic items, like objects
and actions as well as their spatial-temporal correlations from the redundant
frames and semantic content. To this end, existing works select either the key
video clips in a global level~(across multi frames), or key regions within each
frame, which, however, neglect the hierarchical order, i.e., key frames first
and key regions latter. In this paper, we propose a novel joint-hierarchical
attention model for video captioning, which embeds the key clips, the key
frames and the key regions jointly into the captioning model in a hierarchical
manner. Such a joint-hierarchical attention model first conducts a global
selection to identify key frames, followed by a Gumbel sampling operation to
identify further key regions based on the key frames, achieving an accurate
global-to-local feature representation to guide the captioning. Extensive
quantitative evaluations on two public benchmark datasets MSVD and MSR-VTT
demonstrates the superiority of the proposed method over the state-of-the-art
methods.
Related papers
- The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.
We propose VRS-HQ, an end-to-end video reasoning segmentation approach.
Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z) - Classifier-Guided Captioning Across Modalities [69.75111271002137]
We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning.
Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system.
Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
arXiv Detail & Related papers (2025-01-03T18:09:26Z) - DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation.
We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity.
Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z) - A Challenging Multimodal Video Summary: Simultaneously Extracting and
Generating Keyframe-Caption Pairs from Video [20.579167394855197]
This paper proposes a practical multimodal video summarization task setting and dataset to train and evaluate the task.
The target task involves summarizing a given video into a number ofcaption pairs and displaying them in a listable format to grasp the video content quickly.
This task is useful as a practical application and presents a highly challenging problem worthy of study.
arXiv Detail & Related papers (2023-12-04T02:17:14Z) - Local-Global Associative Frame Assemble in Video Re-ID [57.7470971197962]
Noisy and unrepresentative frames in automatically generated object bounding boxes from video sequences cause challenges in learning discriminative representations in video re-identification (Re-ID)
Most existing methods tackle this problem by assessing the importance of video frames according to either their local part alignments or global appearance correlations separately.
In this work, we explore jointly both local alignments and global correlations with further consideration of their mutual promotion/reinforcement.
arXiv Detail & Related papers (2021-10-22T19:07:39Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.