Open-book Video Captioning with Retrieve-Copy-Generate Network
- URL: http://arxiv.org/abs/2103.05284v1
- Date: Tue, 9 Mar 2021 08:17:17 GMT
- Title: Open-book Video Captioning with Retrieve-Copy-Generate Network
- Authors: Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Ying Shan, Bing Li, Ying Deng,
Weiming Hu
- Abstract summary: In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning.
We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively.
Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
- Score: 42.374461018847114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the rapid emergence of short videos and the requirement for content
understanding and creation, the video captioning task has received increasing
attention in recent years. In this paper, we convert traditional video
captioning task into a new paradigm, \ie, Open-book Video Captioning, which
generates natural language under the prompts of video-content-relevant
sentences, not limited to the video itself. To address the open-book video
captioning problem, we propose a novel Retrieve-Copy-Generate network, where a
pluggable video-to-text retriever is constructed to retrieve sentences as hints
from the training corpus effectively, and a copy-mechanism generator is
introduced to extract expressions from multi-retrieved sentences dynamically.
The two modules can be trained end-to-end or separately, which is flexible and
extensible. Our framework coordinates the conventional retrieval-based methods
with orthodox encoder-decoder methods, which can not only draw on the diverse
expressions in the retrieved sentences but also generate natural and accurate
content of the video. Extensive experiments on several benchmark datasets show
that our proposed approach surpasses the state-of-the-art performance,
indicating the effectiveness and promising of the proposed paradigm in the task
of video captioning.
Related papers
- Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z) - Controllable Video Captioning with an Exemplar Sentence [89.78812365216983]
We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
arXiv Detail & Related papers (2021-12-02T09:24:45Z) - Syntax Customized Video Captioning by Imitating Exemplar Sentences [90.98221715705435]
We introduce a new task of Syntax Customized Video Captioning (SCVC)
SCVC aims to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence.
We demonstrate our model capability to generate syntax-varied and semantics-coherent video captions.
arXiv Detail & Related papers (2021-12-02T09:08:09Z) - Transcript to Video: Efficient Clip Sequencing from Texts [65.87890762420922]
We present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots.
Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles.
For fast inference, we introduce an efficient search strategy for real-time video clip sequencing.
arXiv Detail & Related papers (2021-07-25T17:24:50Z) - Towards Diverse Paragraph Captioning for Untrimmed Videos [40.205433926432434]
Existing approaches mainly solve the problem in two steps: event detection and then event captioning.
We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
arXiv Detail & Related papers (2021-05-30T09:28:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.