Controllable Video Captioning with an Exemplar Sentence
- URL: http://arxiv.org/abs/2112.01073v1
- Date: Thu, 2 Dec 2021 09:24:45 GMT
- Title: Controllable Video Captioning with an Exemplar Sentence
- Authors: Yitian Yuan, Lin Ma, Jingwen Wang, Wenwu Zhu
- Abstract summary: We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
- Score: 89.78812365216983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate a novel and challenging task, namely
controllable video captioning with an exemplar sentence. Formally, given a
video and a syntactically valid exemplar sentence, the task aims to generate
one caption which not only describes the semantic contents of the video, but
also follows the syntactic form of the given exemplar sentence. In order to
tackle such an exemplar-based video captioning task, we propose a novel Syntax
Modulated Caption Generator (SMCG) incorporated in an
encoder-decoder-reconstructor architecture. The proposed SMCG takes video
semantic representation as an input, and conditionally modulates the gates and
cells of long short-term memory network with respect to the encoded syntactic
information of the given exemplar sentence. Therefore, SMCG is able to control
the states for word prediction and achieve the syntax customized caption
generation. We conduct experiments by collecting auxiliary exemplar sentences
for two public video captioning datasets. Extensive experimental results
demonstrate the effectiveness of our approach on generating syntax controllable
and semantic preserved video captions. By providing different exemplar
sentences, our approach is capable of producing different captions with various
syntactic structures, thus indicating a promising way to strengthen the
diversity of video captioning.
Related papers
- Pseudo-labeling with Keyword Refining for Few-Supervised Video Captioning [42.0725330677271]
We propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module.
Experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios.
arXiv Detail & Related papers (2024-11-06T17:11:44Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video.
Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset.
We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS)
To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z) - Controllable Image Captioning [0.0]
We introduce a novel framework for image captioning which can generate diverse descriptions by capturing the co-dependence between Part-Of-Speech tags and semantics.
We propose a method to generate captions through a Transformer network, which predicts words based on the input Part-Of-Speech tag sequences.
arXiv Detail & Related papers (2022-04-28T07:47:49Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z) - Syntax Customized Video Captioning by Imitating Exemplar Sentences [90.98221715705435]
We introduce a new task of Syntax Customized Video Captioning (SCVC)
SCVC aims to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence.
We demonstrate our model capability to generate syntax-varied and semantics-coherent video captions.
arXiv Detail & Related papers (2021-12-02T09:08:09Z) - Open-book Video Captioning with Retrieve-Copy-Generate Network [42.374461018847114]
In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning.
We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively.
Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
arXiv Detail & Related papers (2021-03-09T08:17:17Z) - Guidance Module Network for Video Captioning [19.84617164810336]
We find that the normalization of extracted video features can improve the final performance of video captioning.
In this paper, we present a novel architecture which introduces a guidance module to encourage the encoder-decoder model to generate words related to the past and future words in a caption.
arXiv Detail & Related papers (2020-12-20T14:02:28Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.