Visual Semantic Relatedness Dataset for Image Captioning
- URL: http://arxiv.org/abs/2301.08784v2
- Date: Sun, 30 Apr 2023 20:23:09 GMT
- Title: Visual Semantic Relatedness Dataset for Image Captioning
- Authors: Ahmed Sabir, Francesc Moreno-Noguer, Llu\'is Padr\'o
- Abstract summary: We propose a textual visual context dataset for captioning, in which the dataset COCO Captions has been extended with information about the scene.
This information can be used to leverage any NLP task, such as text similarity or semantic relation methods, into captioning systems.
- Score: 27.788077963411624
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern image captioning system relies heavily on extracting knowledge from
images to capture the concept of a static story. In this paper, we propose a
textual visual context dataset for captioning, in which the publicly available
dataset COCO Captions (Lin et al., 2014) has been extended with information
about the scene (such as objects in the image). Since this information has a
textual form, it can be used to leverage any NLP task, such as text similarity
or semantic relation methods, into captioning systems, either as an end-to-end
training strategy or a post-processing based approach.
Related papers
- What Makes for Good Image Captions? [50.48589893443939]
Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans.
We introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information.
arXiv Detail & Related papers (2024-05-01T12:49:57Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language.
Most existing VG datasets are constructed using simple description texts.
We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z) - CapText: Large Language Model-based Caption Generation From Image
Context and Description [0.0]
We propose and evaluate a new approach to generate captions from textual descriptions and context alone.
Our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
arXiv Detail & Related papers (2023-06-01T02:40:44Z) - Generating image captions with external encyclopedic knowledge [1.452875650827562]
We create an end-to-end caption generation system that makes extensive use of image-specific encyclopedic data.
Our approach includes a novel way of using image location to identify relevant open-domain facts in an external knowledge base.
Our system is trained and tested on a new dataset with naturally produced knowledge-rich captions.
arXiv Detail & Related papers (2022-10-10T16:09:21Z) - Towards Multimodal Vision-Language Models Generating Non-Generic Text [2.102846336724103]
Vision-language models can assess visual context in an image and generate descriptive text.
Recent work has used optical character recognition to supplement visual information with text extracted from an image.
In this work, we contend that vision-language models can benefit from additional information that can be extracted from an image, but are not used by current models.
arXiv Detail & Related papers (2022-07-09T01:56:35Z) - Knowledge Mining with Scene Text for Fine-Grained Recognition [53.74297368412834]
We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image.
We employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification.
Our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively.
arXiv Detail & Related papers (2022-03-27T05:54:00Z) - Telling the What while Pointing the Where: Fine-grained Mouse Trace and
Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is.
In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where")
Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z) - Textual Visual Semantic Dataset for Text Spotting [27.788077963411624]
Text Spotting in the wild consists of detecting and recognizing text appearing in images.
This is a challenging problem due to the complexity of the context where texts appear.
We propose a visual context dataset for Text Spotting in the wild.
arXiv Detail & Related papers (2020-04-21T23:58:16Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.