Cross-Modal Graph with Meta Concepts for Video Captioning
- URL: http://arxiv.org/abs/2108.06458v1
- Date: Sat, 14 Aug 2021 04:00:42 GMT
- Title: Cross-Modal Graph with Meta Concepts for Video Captioning
- Authors: Hao Wang, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao
- Abstract summary: We propose Cross-Modal Graph (CMG) with meta concepts for video captioning.
To cover the useful semantic concepts in video captions, we weakly learn the corresponding visual regions for text descriptions.
We construct holistic video-level and local frame-level video graphs with the predicted predicates to model video sequence structures.
- Score: 101.97397967958722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video captioning targets interpreting the complex visual contents as text
descriptions, which requires the model to fully understand video scenes
including objects and their interactions. Prevailing methods adopt
off-the-shelf object detection networks to give object proposals and use the
attention mechanism to model the relations between objects. They often miss
some undefined semantic concepts of the pretrained model and fail to identify
exact predicate relationships between objects. In this paper, we investigate an
open research task of generating text descriptions for the given videos, and
propose Cross-Modal Graph (CMG) with meta concepts for video captioning.
Specifically, to cover the useful semantic concepts in video captions, we
weakly learn the corresponding visual regions for text descriptions, where the
associated visual regions and textual words are named cross-modal meta
concepts. We further build meta concept graphs dynamically with the learned
cross-modal meta concepts. We also construct holistic video-level and local
frame-level video graphs with the predicted predicates to model video sequence
structures. We validate the efficacy of our proposed techniques with extensive
experiments and achieve state-of-the-art results on two public datasets.
Related papers
- SceneGATE: Scene-Graph based co-Attention networks for TExt visual
question answering [2.8974040580489198]
The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA.
It reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words.
It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image.
arXiv Detail & Related papers (2022-12-16T05:10:09Z) - Modeling Semantic Composition with Syntactic Hypergraph for Video
Question Answering [14.033438649614219]
Key challenge in video question answering is how to realize the cross-modal semantic alignment between textual concepts and corresponding visual objects.
We propose to first build a syntactic dependency tree for each question with an off-the-shelf tool.
Based on the extracted compositions, a hypergraph is further built by viewing the words as nodes and the compositions as hyperedges.
arXiv Detail & Related papers (2022-05-13T09:28:13Z) - Discourse Analysis for Evaluating Coherence in Video Paragraph Captions [99.37090317971312]
We are exploring a novel discourse based framework to evaluate the coherence of video paragraphs.
Central to our approach is the discourse representation of videos, which helps in modeling coherence of paragraphs conditioned on coherence of videos.
Our experiment results have shown that the proposed framework evaluates coherence of video paragraphs significantly better than all the baseline methods.
arXiv Detail & Related papers (2022-01-17T04:23:08Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Relational Graph Learning for Grounded Video Description Generation [85.27028390401136]
Grounded description (GVD) encourages captioning models to attend to appropriate video regions dynamically and generate a description.
Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description.
We design a novel relational graph learning framework for GVD, in which a language-refined scene graph representation is designed to explore fine-grained visual concepts.
arXiv Detail & Related papers (2021-12-02T03:48:45Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.