Comprehending and Ordering Semantics for Image Captioning
- URL: http://arxiv.org/abs/2206.06930v1
- Date: Tue, 14 Jun 2022 15:51:14 GMT
- Title: Comprehending and Ordering Semantics for Image Captioning
- Authors: Yehao Li and Yingwei Pan and Ting Yao and Tao Mei
- Abstract summary: We propose a new recipe of Transformer-style structure, namely Comprehending and Ordering Semantics Networks (COS-Net)
COS-Net unifies an enriched semantic comprehending and a learnable semantic ordering processes into a single architecture.
- Score: 124.48670699658649
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Comprehending the rich semantics in an image and ordering them in linguistic
order are essential to compose a visually-grounded and linguistically coherent
description for image captioning. Modern techniques commonly capitalize on a
pre-trained object detector/classifier to mine the semantics in an image, while
leaving the inherent linguistic ordering of semantics under-exploited. In this
paper, we propose a new recipe of Transformer-style structure, namely
Comprehending and Ordering Semantics Networks (COS-Net), that novelly unifies
an enriched semantic comprehending and a learnable semantic ordering processes
into a single architecture. Technically, we initially utilize a cross-modal
retrieval model to search the relevant sentences of each image, and all words
in the searched sentences are taken as primary semantic cues. Next, a novel
semantic comprehender is devised to filter out the irrelevant semantic words in
primary semantic cues, and meanwhile infer the missing relevant semantic words
visually grounded in the image. After that, we feed all the screened and
enriched semantic words into a semantic ranker, which learns to allocate all
semantic words in linguistic order as humans. Such sequence of ordered semantic
words are further integrated with visual tokens of images to trigger sentence
generation. Empirical evidences show that COS-Net clearly surpasses the
state-of-the-art approaches on COCO and achieves to-date the best CIDEr score
of 141.1% on Karpathy test split. Source code is available at
\url{https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/cosnet}.
Related papers
- Towards Image Semantics and Syntax Sequence Learning [8.033697392628424]
We introduce the concept of "image grammar", consisting of "image semantics" and "image syntax"
We propose a weakly supervised two-stage approach to learn the image grammar relative to a class of visual objects/scenes.
Our framework is trained to reason over patch semantics and detect faulty syntax.
arXiv Detail & Related papers (2024-01-31T00:16:02Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - Edge Guided GANs with Multi-Scale Contrastive Learning for Semantic
Image Synthesis [139.2216271759332]
We propose a novel ECGAN for the challenging semantic image synthesis task.
The semantic labels do not provide detailed structural information, making it challenging to synthesize local details and structures.
The widely adopted CNN operations such as convolution, down-sampling, and normalization usually cause spatial resolution loss.
We propose a novel contrastive learning method, which aims to enforce pixel embeddings belonging to the same semantic class to generate more similar image content.
arXiv Detail & Related papers (2023-07-22T14:17:19Z) - Vocabulary-free Image Classification [75.38039557783414]
We formalize a novel task, termed as Vocabulary-free Image Classification (VIC)
VIC aims to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary.
CaSED is a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner.
arXiv Detail & Related papers (2023-06-01T17:19:43Z) - Towards Semantic Communications: Deep Learning-Based Image Semantic
Coding [42.453963827153856]
We conceive the semantic communications for image data that is much more richer in semantics and bandwidth sensitive.
We propose an reinforcement learning based adaptive semantic coding (RL-ASC) approach that encodes images beyond pixel level.
Experimental results demonstrate that the proposed RL-ASC is noise robust and could reconstruct visually pleasant and semantic consistent image.
arXiv Detail & Related papers (2022-08-08T12:29:55Z) - HIRL: A General Framework for Hierarchical Image Representation Learning [54.12773508883117]
We propose a general framework for Hierarchical Image Representation Learning (HIRL)
This framework aims to learn multiple semantic representations for each image, and these representations are structured to encode image semantics from fine-grained to coarse-grained.
Based on a probabilistic factorization, HIRL learns the most fine-grained semantics by an off-the-shelf image SSL approach and learns multiple coarse-grained semantics by a novel semantic path discrimination scheme.
arXiv Detail & Related papers (2022-05-26T05:13:26Z) - Adaptive Semantic-Visual Tree for Hierarchical Embeddings [67.01307058209709]
We propose a hierarchical adaptive semantic-visual tree to depict the architecture of merchandise categories.
The tree evaluates semantic similarities between different semantic levels and visual similarities within the same semantic class simultaneously.
At each level, we set different margins based on the semantic hierarchy and incorporate them as prior information to learn a fine-grained feature embedding.
arXiv Detail & Related papers (2020-03-08T03:36:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.