Hierarchical Modular Network for Video Captioning
- URL: http://arxiv.org/abs/2111.12476v2
- Date: Thu, 25 Nov 2021 01:41:35 GMT
- Title: Hierarchical Modular Network for Video Captioning
- Authors: Hanhua Ye, Guorong Li, Yuankai Qi, Shuhui Wang, Qingming Huang,
Ming-Hsuan Yang
- Abstract summary: We propose a hierarchical modular network to bridge video representations and linguistic semantics from three levels before generating captions.
The proposed method performs favorably against the state-of-the-art models on the two widely-used benchmarks: MSVD 104.0% and MSR-VTT 51.5% in CIDEr score.
- Score: 162.70349114104107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video captioning aims to generate natural language descriptions according to
the content, where representation learning plays a crucial role. Existing
methods are mainly developed within the supervised learning framework via
word-by-word comparison of the generated caption against the ground-truth text
without fully exploiting linguistic semantics. In this work, we propose a
hierarchical modular network to bridge video representations and linguistic
semantics from three levels before generating captions. In particular, the
hierarchy is composed of: (I) Entity level, which highlights objects that are
most likely to be mentioned in captions. (II) Predicate level, which learns the
actions conditioned on highlighted objects and is supervised by the predicate
in captions. (III) Sentence level, which learns the global semantic
representation and is supervised by the whole caption. Each level is
implemented by one module. Extensive experimental results show that the
proposed method performs favorably against the state-of-the-art models on the
two widely-used benchmarks: MSVD 104.0% and MSR-VTT 51.5% in CIDEr score.
Related papers
- Towards the Next Frontier in Speech Representation Learning Using Disentanglement [34.21745744502759]
We propose a framework for Learning Disentangled Self Supervised (termed as Learn2Diss) representations of speech, which consists of frame-level and an utterance-level encoder modules.
We show that the proposed Learn2Diss achieves state-of-the-art results on a variety of tasks, with the frame-level encoder representations improving semantic tasks, while the utterance-level representations improve non-semantic tasks.
arXiv Detail & Related papers (2024-07-02T07:13:35Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Unifying Latent and Lexicon Representations for Effective Video-Text
Retrieval [87.69394953339238]
We propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics in video-text retrieval.
We show our framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.
arXiv Detail & Related papers (2024-02-26T17:36:50Z) - Boosting Video-Text Retrieval with Explicit High-Level Semantics [115.66219386097295]
We propose a novel visual-linguistic aligning model named HiSE for VTR.
It improves the cross-modal representation by incorporating explicit high-level semantics.
Our method achieves the superior performance over state-of-the-art methods on three benchmark datasets.
arXiv Detail & Related papers (2022-08-08T15:39:54Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Discriminative Latent Semantic Graph for Video Captioning [24.15455227330031]
Video captioning aims to automatically generate natural language sentences that describe the visual contents of a given video.
Our main contribution is to identify three key problems in a joint framework for future video summarization tasks.
arXiv Detail & Related papers (2021-08-08T15:11:20Z) - O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable
Video Captioning [41.14313691818424]
We propose an Object-Oriented Non-Autoregressive approach (O2NA) for video captioning.
O2NA performs caption generation in three steps: 1) identify the focused objects and predict their locations in the target caption; 2) generate the related attribute words and relation words of these focused objects to form a draft caption; and 3) combine video information to refine the draft caption to a fluent final caption.
Experiments on two benchmark datasets, MSR-VTT and MSVD, demonstrate the effectiveness of O2NA.
arXiv Detail & Related papers (2021-08-05T04:17:20Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Text-Free Image-to-Speech Synthesis Using Learned Segmental Units [24.657722909094662]
We present the first model for directly fluent, natural-sounding spoken audio captions for images.
We connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units.
We conduct experiments on the Flickr8k spoken caption dataset and a novel corpus of spoken audio captions collected for the popular MSCOCO dataset.
arXiv Detail & Related papers (2020-12-31T05:28:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.