Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph
- URL: http://arxiv.org/abs/2107.11970v1
- Date: Mon, 26 Jul 2021 05:50:41 GMT
- Title: Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph
- Authors: Wentian Zhao, Yao Hu, Heda Wang, Xinxiao Wu, Jiebo Luo
- Abstract summary: It is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities.
We propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities.
- Score: 96.95815946327079
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Entity-aware image captioning aims to describe named entities and events
related to the image by utilizing the background knowledge in the associated
article. This task remains challenging as it is difficult to learn the
association between named entities and visual cues due to the long-tail
distribution of named entities. Furthermore, the complexity of the article
brings difficulty in extracting fine-grained relationships between entities to
generate informative event descriptions about the image. To tackle these
challenges, we propose a novel approach that constructs a multi-modal knowledge
graph to associate the visual objects with named entities and capture the
relationship between entities simultaneously with the help of external
knowledge collected from the web. Specifically, we build a text sub-graph by
extracting named entities and their relationships from the article, and build
an image sub-graph by detecting the objects in the image. To connect these two
sub-graphs, we propose a cross-modal entity matching module trained using a
knowledge base that contains Wikipedia entries and the corresponding images.
Finally, the multi-modal knowledge graph is integrated into the captioning
model via a graph attention mechanism. Extensive experiments on both GoodNews
and NYTimes800k datasets demonstrate the effectiveness of our method.
Related papers
- ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Few-Shot Relation Extraction with Hybrid Visual Evidence [3.154631846975021]
We propose a multi-modal few-shot relation extraction model (MFS-HVE)
MFS-HVE includes semantic feature extractors and multi-modal fusion components.
Experiments conducted on two public datasets demonstrate that semantic visual information significantly improves the performance of few-shot relation prediction.
arXiv Detail & Related papers (2024-03-01T18:20:11Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Learning to Represent Image and Text with Denotation Graph [32.417311523031195]
We propose learning representations from a set of implied, visually grounded expressions between image and text.
We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations.
arXiv Detail & Related papers (2020-10-06T18:00:58Z) - Multi-Modal Retrieval using Graph Neural Networks [1.8911962184174562]
We learn a joint vision and concept embedding in the same high-dimensional space.
We model the visual and concept relationships as a graph structure.
We also introduce a novel inference time control, based on selective neighborhood connectivity.
arXiv Detail & Related papers (2020-10-04T19:34:20Z) - Learning semantic Image attributes using Image recognition and knowledge
graph embeddings [0.3222802562733786]
We propose a shared learning approach to learn semantic attributes of images by combining a knowledge graph embedding model with the recognized attributes of images.
The proposed approach is a step towards bridging the gap between frameworks which learn from large amounts of data and frameworks which use a limited set of predicates to infer new knowledge.
arXiv Detail & Related papers (2020-09-12T15:18:48Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.