Relational Graph Learning for Grounded Video Description Generation
- URL: http://arxiv.org/abs/2112.00967v1
- Date: Thu, 2 Dec 2021 03:48:45 GMT
- Title: Relational Graph Learning for Grounded Video Description Generation
- Authors: Wenqiao Zhang, Xin Eric Wang, Siliang Tang, Haizhou Shi, Haocheng Shi,
Jun Xiao, Yueting Zhuang, William Yang Wang
- Abstract summary: Grounded description (GVD) encourages captioning models to attend to appropriate video regions dynamically and generate a description.
Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description.
We design a novel relational graph learning framework for GVD, in which a language-refined scene graph representation is designed to explore fine-grained visual concepts.
- Score: 85.27028390401136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounded video description (GVD) encourages captioning models to attend to
appropriate video regions (e.g., objects) dynamically and generate a
description. Such a setting can help explain the decisions of captioning models
and prevents the model from hallucinating object words in its description.
However, such design mainly focuses on object word generation and thus may
ignore fine-grained information and suffer from missing visual concepts.
Moreover, relational words (e.g., "jump left or right") are usual
spatio-temporal inference results, i.e., these words cannot be grounded on
certain spatial regions. To tackle the above limitations, we design a novel
relational graph learning framework for GVD, in which a language-refined scene
graph representation is designed to explore fine-grained visual concepts.
Furthermore, the refined graph can be regarded as relational inductive
knowledge to assist captioning models in selecting the relevant information it
needs to generate correct words. We validate the effectiveness of our model
through automatic metrics and human evaluation, and the results indicate that
our approach can generate more fine-grained and accurate description, and it
solves the problem of object hallucination to some extent.
Related papers
- Consensus Graph Representation Learning for Better Grounded Image
Captioning [48.208119537050166]
We propose the Consensus Rraph Representation Learning framework (CGRL) for grounded image captioning.
We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset.
arXiv Detail & Related papers (2021-12-02T04:17:01Z) - Cross-Modal Graph with Meta Concepts for Video Captioning [101.97397967958722]
We propose Cross-Modal Graph (CMG) with meta concepts for video captioning.
To cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions.
We construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures.
arXiv Detail & Related papers (2021-08-14T04:00:42Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.