Syntax Customized Video Captioning by Imitating Exemplar Sentences
- URL: http://arxiv.org/abs/2112.01062v1
- Date: Thu, 2 Dec 2021 09:08:09 GMT
- Title: Syntax Customized Video Captioning by Imitating Exemplar Sentences
- Authors: Yitian Yuan, Lin Ma, Wenwu Zhu
- Abstract summary: We introduce a new task of Syntax Customized Video Captioning (SCVC)
SCVC aims to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence.
We demonstrate our model capability to generate syntax-varied and semantics-coherent video captions.
- Score: 90.98221715705435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Enhancing the diversity of sentences to describe video contents is an
important problem arising in recent video captioning research. In this paper,
we explore this problem from a novel perspective of customizing video captions
by imitating exemplar sentence syntaxes. Specifically, given a video and any
syntax-valid exemplar sentence, we introduce a new task of Syntax Customized
Video Captioning (SCVC) aiming to generate one caption which not only
semantically describes the video contents but also syntactically imitates the
given exemplar sentence. To tackle the SCVC task, we propose a novel video
captioning model, where a hierarchical sentence syntax encoder is firstly
designed to extract the syntactic structure of the exemplar sentence, then a
syntax conditioned caption decoder is devised to generate the syntactically
structured caption expressing video semantics. As there is no available syntax
customized groundtruth video captions, we tackle such a challenge by proposing
a new training strategy, which leverages the traditional pairwise video
captioning data and our collected exemplar sentences to accomplish the model
learning. Extensive experiments, in terms of semantic, syntactic, fluency, and
diversity evaluations, clearly demonstrate our model capability to generate
syntax-varied and semantics-coherent video captions that well imitate different
exemplar sentences with enriched diversities.
Related papers
- Pseudo-labeling with Keyword Refining for Few-Supervised Video Captioning [42.0725330677271]
We propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module.
Experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios.
arXiv Detail & Related papers (2024-11-06T17:11:44Z) - Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models [63.01630478059315]
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance.
It is not clear whether synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood.
We propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models.
arXiv Detail & Related papers (2024-10-03T17:54:52Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - VideoCon: Robust Video-Language Alignment via Contrast Captions [80.08882631838914]
Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.
Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order.
Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
arXiv Detail & Related papers (2023-11-15T19:51:57Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z) - Controllable Video Captioning with an Exemplar Sentence [89.78812365216983]
We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
arXiv Detail & Related papers (2021-12-02T09:24:45Z) - Open-book Video Captioning with Retrieve-Copy-Generate Network [42.374461018847114]
In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning.
We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively.
Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
arXiv Detail & Related papers (2021-03-09T08:17:17Z) - Enriching Video Captions With Contextual Text [9.994985014558383]
We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input.
We do not preprocess the text further, and let the model directly learn to attend over it.
arXiv Detail & Related papers (2020-07-29T08:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.