CAPTION: Correction by Analyses, POS-Tagging and Interpretation of
Objects using only Nouns
- URL: http://arxiv.org/abs/2010.00839v1
- Date: Fri, 2 Oct 2020 08:06:42 GMT
- Title: CAPTION: Correction by Analyses, POS-Tagging and Interpretation of
Objects using only Nouns
- Authors: Leonardo Anjoletto Ferreira, Douglas De Rizzo Meneghetti, Paulo
Eduardo Santos
- Abstract summary: This work proposes a combination of Deep Learning methods for object detection and natural language processing to validate image's captions.
We test our method in the FOIL-COCO data set, since it provides correct and incorrect captions for various images using only objects represented in the MS-COCO image data set.
- Score: 1.4502611532302039
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Deep Learning (DL) methods have shown an excellent performance in
image captioning and visual question answering. However, despite their
performance, DL methods do not learn the semantics of the words that are being
used to describe a scene, making it difficult to spot incorrect words used in
captions or to interchange words that have similar meanings. This work proposes
a combination of DL methods for object detection and natural language
processing to validate image's captions. We test our method in the FOIL-COCO
data set, since it provides correct and incorrect captions for various images
using only objects represented in the MS-COCO image data set. Results show that
our method has a good overall performance, in some cases similar to the human
performance.
Related papers
- Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval [53.89454443114146]
We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets.
Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space.
We propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs)
KEDs implicitly models the attributes of the reference images by incorporating a database.
arXiv Detail & Related papers (2024-03-24T04:23:56Z) - TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs.
We parse objects and attributes from the description, which are highly likely to exist in the image.
Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - Scene Text Recognition with Image-Text Matching-guided Dictionary [17.073688809336456]
We propose a new dictionary language model leveraging the Scene Image-Text Matching(SITM) network.
Inspired by ITC, the SITM network combines the visual features and the text features of all candidates to identify the candidate with the minimum distance in the feature space.
Our lexicon method achieves better results(93.8% accuracy) than the ordinary method results(92.1% accuracy) on six mainstream benchmarks.
arXiv Detail & Related papers (2023-05-08T07:47:49Z) - What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
without Text Inputs [82.93345261434943]
Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects.
This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism.
Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains.
arXiv Detail & Related papers (2022-06-19T09:07:30Z) - Knowledge Mining with Scene Text for Fine-Grained Recognition [53.74297368412834]
We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image.
We employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification.
Our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively.
arXiv Detail & Related papers (2022-03-27T05:54:00Z) - Tell me what you see: A zero-shot action recognition method based on
natural language descriptions [3.136605193634262]
We propose using video captioning methods to extract semantic information from videos.
To the best of our knowledge, this is the first work to represent both videos and labels with descriptive sentences.
We build a shared semantic space employing BERT-based embedders pre-trained in the paraphrasing task on multiple text datasets.
arXiv Detail & Related papers (2021-12-18T17:44:07Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z) - Removing Word-Level Spurious Alignment between Images and
Pseudo-Captions in Unsupervised Image Captioning [37.14912430046118]
Unsupervised image captioning is a challenging task that aims at generating captions without the supervision of image-sentence pairs.
We propose a simple gating mechanism that is trained to align image features with only the most reliable words in pseudo-captions.
arXiv Detail & Related papers (2021-04-28T16:36:52Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.