Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
- URL: http://arxiv.org/abs/2312.15720v1
- Date: Mon, 25 Dec 2023 13:13:04 GMT
- Title: Set Prediction Guided by Semantic Concepts for Diverse Video Captioning
- Authors: Yifan Lu, Ziqi Zhang, Chunfeng Yuan, Peng Li, Yan Wang, Bing Li,
Weiming Hu
- Abstract summary: We formulate diverse captioning into a semantic-concept-guided set prediction problem.
We apply a diversity regularization term on concepts to encourage the model to generate semantically diverse captions.
The proposed model achieves state-of-the-art (SOTA) performance under both relevance and diversity metrics.
- Score: 47.89731738027379
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diverse video captioning aims to generate a set of sentences to describe the
given video in various aspects. Mainstream methods are trained with independent
pairs of a video and a caption from its ground-truth set without exploiting the
intra-set relationship, resulting in low diversity of generated captions.
Different from them, we formulate diverse captioning into a
semantic-concept-guided set prediction (SCG-SP) problem by fitting the
predicted caption set to the ground-truth set, where the set-level relationship
is fully captured. Specifically, our set prediction consists of two synergistic
tasks, i.e., caption generation and an auxiliary task of concept combination
prediction providing extra semantic supervision. Each caption in the set is
attached to a concept combination indicating the primary semantic content of
the caption and facilitating element alignment in set prediction. Furthermore,
we apply a diversity regularization term on concepts to encourage the model to
generate semantically diverse captions with various concept combinations. These
two tasks share multiple semantics-specific encodings as input, which are
obtained by iterative interaction between visual features and conceptual
queries. The correspondence between the generated captions and specific concept
combinations further guarantees the interpretability of our model. Extensive
experiments on benchmark datasets show that the proposed SCG-SP achieves
state-of-the-art (SOTA) performance under both relevance and diversity metrics.
Related papers
- Collaboratively Self-supervised Video Representation Learning for Action
Recognition [58.195372471117615]
We design a Collaboratively Self-supervised Video Representation learning framework specific to action recognition.
Our method achieves state-of-the-art performance on the UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2024-01-15T10:42:04Z) - ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation [17.019848796027485]
Self-supervised visual pre-training models have shown great promise in representing pixel-level semantic relationships.
In this work, we investigate the pixel-level semantic aggregation in self-trained models as image encodes and design concepts.
We propose the Adaptive Concept Generator (ACG) which adaptively maps these prototypes to informative concepts for each image.
arXiv Detail & Related papers (2022-10-12T06:16:34Z) - Diverse Video Captioning by Adaptive Spatio-temporal Attention [7.96569366755701]
Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures.
We introduce an adaptive frame selection scheme to reduce the number of required incoming frames.
We estimate semantic concepts relevant for video captioning by aggregating all ground captions truth of each sample.
arXiv Detail & Related papers (2022-08-19T11:21:59Z) - Controllable Video Captioning with an Exemplar Sentence [89.78812365216983]
We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
arXiv Detail & Related papers (2021-12-02T09:24:45Z) - Syntax Customized Video Captioning by Imitating Exemplar Sentences [90.98221715705435]
We introduce a new task of Syntax Customized Video Captioning (SCVC)
SCVC aims to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence.
We demonstrate our model capability to generate syntax-varied and semantics-coherent video captions.
arXiv Detail & Related papers (2021-12-02T09:08:09Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene.
This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.