ReFormer: The Relational Transformer for Image Captioning
- URL: http://arxiv.org/abs/2107.14178v1
- Date: Thu, 29 Jul 2021 17:03:36 GMT
- Title: ReFormer: The Relational Transformer for Image Captioning
- Authors: Xuewen Yang, Yingru Liu, Xin Wang
- Abstract summary: Image captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image.
We propose a novel architecture ReFormer to generate features with relation information embedded.
Our model significantly outperforms state-of-the-art methods on image captioning and scene graph generation.
- Score: 12.184772369145014
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Image captioning is shown to be able to achieve a better performance by using
scene graphs to represent the relations of objects in the image. The current
captioning encoders generally use a Graph Convolutional Net (GCN) to represent
the relation information and merge it with the object region features via
concatenation or convolution to get the final input for sentence decoding.
However, the GCN-based encoders in the existing methods are less effective for
captioning due to two reasons. First, using the image captioning as the
objective (i.e., Maximum Likelihood Estimation) rather than a relation-centric
loss cannot fully explore the potential of the encoder. Second, using a
pre-trained model instead of the encoder itself to extract the relationships is
not flexible and cannot contribute to the explainability of the model. To
improve the quality of image captioning, we propose a novel architecture
ReFormer -- a RElational transFORMER to generate features with relation
information embedded and to explicitly express the pair-wise relationships
between objects in the image. ReFormer incorporates the objective of scene
graph generation with that of image captioning using one modified Transformer
model. This design allows ReFormer to generate not only better image captions
with the bene-fit of extracting strong relational image features, but also
scene graphs to explicitly describe the pair-wise relation-ships. Experiments
on publicly available datasets show that our model significantly outperforms
state-of-the-art methods on image captioning and scene graph generation
Related papers
- Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Transform, Contrast and Tell: Coherent Entity-Aware Multi-Image
Captioning [0.65268245109828]
Coherent entity-aware multi-image captioning aims to generate coherent captions for neighboring images in a news document.
This paper proposes a coherent entity-aware multi-image captioning model by making use of coherence relationships.
arXiv Detail & Related papers (2023-02-04T07:50:31Z) - Diffusion-Based Scene Graph to Image Generation with Masked Contrastive
Pre-Training [112.94542676251133]
We propose to learn scene graph embeddings by directly optimizing their alignment with images.
Specifically, we pre-train an encoder to extract both global and local information from scene graphs.
The resulting method, called SGDiff, allows for the semantic manipulation of generated images by modifying scene graph nodes and connections.
arXiv Detail & Related papers (2022-11-21T01:11:19Z) - Transforming Image Generation from Scene Graphs [11.443097632746763]
We propose a transformer-based approach conditioned by scene graphs that employs a decoder to autoregressively compose images.
The proposed architecture is composed by three modules: 1) a graph convolutional network, to encode the relationships of the input graph; 2) an encoder-decoder transformer, which autoregressively composes the output image; 3) an auto-encoder, employed to generate representations used as input/output of each generation step by the transformer.
arXiv Detail & Related papers (2022-07-01T16:59:38Z) - SG2Caps: Revisiting Scene Graphs for Image Captioning [37.58310822924814]
We propose a framework, SG2Caps, that utilizes only the scene graph labels for competitive image caption-ing performance.
Our framework outperforms existing scene graph-only captioning models by a large margin (CIDEr score of 110 vs 71) indicating scene graphs as a promising representation for image captioning.
arXiv Detail & Related papers (2021-02-09T18:00:53Z) - Are scene graphs good enough to improve Image Captioning? [19.36188161855731]
We investigate the use of scene graphs in image captioning.
We find no significant difference between models that use scene graph features and models that only use object detection features.
Although the quality of predicted scene graphs is very low in general, when using high quality scene graphs we obtain gains of up to 3.3 CIDEr.
arXiv Detail & Related papers (2020-09-25T16:09:08Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.