Guidance Module Network for Video Captioning
- URL: http://arxiv.org/abs/2012.10930v1
- Date: Sun, 20 Dec 2020 14:02:28 GMT
- Title: Guidance Module Network for Video Captioning
- Authors: Xiao Zhang, Chunsheng Liu, Faliang Chang
- Abstract summary: We find that the normalization of extracted video features can improve the final performance of video captioning.
In this paper, we present a novel architecture which introduces a guidance module to encourage the encoder-decoder model to generate words related to the past and future words in a caption.
- Score: 19.84617164810336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video captioning has been a challenging and significant task that describes
the content of a video clip in a single sentence. The model of video captioning
is usually an encoder-decoder. We find that the normalization of extracted
video features can improve the final performance of video captioning.
Encoder-decoder model is usually trained using teacher-enforced strategies to
make the prediction probability of each word close to a 0-1 distribution and
ignore other words. In this paper, we present a novel architecture which
introduces a guidance module to encourage the encoder-decoder model to generate
words related to the past and future words in a caption. Based on the
normalization and guidance module, guidance module net (GMNet) is built.
Experimental results on commonly used dataset MSVD show that proposed GMNet can
improve the performance of the encoder-decoder model on video captioning tasks.
Related papers
- Streaming Dense Video Captioning [85.70265343236687]
An ideal model for dense video captioning should be able to handle long input videos, predict rich, detailed textual descriptions.
Current state-of-the-art models process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video.
We propose a streaming dense video captioning model that consists of two novel components.
arXiv Detail & Related papers (2024-04-01T17:59:15Z) - Attention Based Encoder Decoder Model for Video Captioning in Nepali (2023) [0.0]
This work develops a novel encoder-decoder paradigm for Nepali video captioning to tackle this difficulty.
LSTM and GRU sequence-to-sequence models are used in the model to produce related textual descriptions based on features retrieved from video frames using CNNs.
The efficiency of the model for Devanagari-scripted video captioning is demonstrated by BLEU, METOR, and ROUGE measures, which are used to assess its performance.
arXiv Detail & Related papers (2023-12-12T16:39:12Z) - Encoder-Decoder Based Long Short-Term Memory (LSTM) Model for Video
Captioning [0.0]
This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-many mapping of video data to text captions.
The many-to-many mapping occurs via an input temporal sequence of video frames to an output sequence of words to form a caption sentence.
arXiv Detail & Related papers (2023-10-02T02:32:26Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z) - Controllable Video Captioning with an Exemplar Sentence [89.78812365216983]
We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
arXiv Detail & Related papers (2021-12-02T09:24:45Z) - Syntax Customized Video Captioning by Imitating Exemplar Sentences [90.98221715705435]
We introduce a new task of Syntax Customized Video Captioning (SCVC)
SCVC aims to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence.
We demonstrate our model capability to generate syntax-varied and semantics-coherent video captions.
arXiv Detail & Related papers (2021-12-02T09:08:09Z) - Open-book Video Captioning with Retrieve-Copy-Generate Network [42.374461018847114]
In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning.
We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively.
Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
arXiv Detail & Related papers (2021-03-09T08:17:17Z) - Auto-captions on GIF: A Large-scale Video-sentence Dataset for
Vision-language Pre-training [112.91603911837436]
Auto-captions on GIF dataset is a new large-scale pre-training dataset for generic video understanding.
All video-sentence pairs are created by automatically extracting and filtering video caption annotations from billions of web pages.
arXiv Detail & Related papers (2020-07-05T16:11:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.