Exploring Explicit and Implicit Visual Relationships for Image
Captioning
- URL: http://arxiv.org/abs/2105.02391v1
- Date: Thu, 6 May 2021 01:47:51 GMT
- Title: Exploring Explicit and Implicit Visual Relationships for Image
Captioning
- Authors: Zeliang Song, Xiaofei Zhou
- Abstract summary: In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
- Score: 11.82805641934772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image captioning is one of the most challenging tasks in AI, which aims to
automatically generate textual sentences for an image. Recent methods for image
captioning follow encoder-decoder framework that transforms the sequence of
salient regions in an image into natural language descriptions. However, these
models usually lack the comprehensive understanding of the contextual
interactions reflected on various visual relationships between objects. In this
paper, we explore explicit and implicit visual relationships to enrich
region-level representations for image captioning. Explicitly, we build
semantic graph over object pairs and exploit gated graph convolutional networks
(Gated GCN) to selectively aggregate local neighbors' information. Implicitly,
we draw global interactions among the detected objects through region-based
bidirectional encoder representations from transformers (Region BERT) without
extra relational annotations. To evaluate the effectiveness and superiority of
our proposed method, we conduct extensive experiments on Microsoft COCO
benchmark and achieve remarkable improvements compared with strong baselines.
Related papers
- Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Synchronizing Vision and Language: Bidirectional Token-Masking
AutoEncoder for Referring Image Segmentation [26.262887028563163]
Referring Image (RIS) aims to segment target objects expressed in natural language within a scene at the pixel level.
We propose a novel bidirectional token-masking autoencoder (BTMAE) inspired by the masked autoencoder (MAE)
BTMAE learns the context of image-to-language and language-to-image by reconstructing missing features in both image and language features at the token level.
arXiv Detail & Related papers (2023-11-29T07:33:38Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Exploring Semantic Relationships for Unpaired Image Captioning [40.401322131624866]
We achieve unpaired image captioning by bridging the vision and the language domains with high-level semantic information.
We propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image.
The proposed approach boosts five strong baselines under the paired setting, where the most significant improvement in CIDEr score reaches 8%.
arXiv Detail & Related papers (2021-06-20T09:10:11Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions.
The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Exploring and Distilling Cross-Modal Information for Image Captioning [47.62261144821135]
We argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest.
Based on the Transformer, we propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language.
Our Transformer-based model achieves a CIDEr score of 129.3 in offline COCO evaluation on the COCO testing set with remarkable efficiency in terms of accuracy, speed, and parameter budget.
arXiv Detail & Related papers (2020-02-28T07:46:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.