#PraCegoVer: A Large Dataset for Image Captioning in Portuguese
- URL: http://arxiv.org/abs/2103.11474v1
- Date: Sun, 21 Mar 2021 19:55:46 GMT
- Title: #PraCegoVer: A Large Dataset for Image Captioning in Portuguese
- Authors: Gabriel Oliveira dos Santos and Esther Luna Colombini and Sandra Avila
- Abstract summary: #PraCegoVer is the first large dataset for image captioning in Portuguese with freely annotated images.
A movement called PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content.
- Score: 6.890235464357029
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automatically describing images using natural sentences is an important task
to support visually impaired people's inclusion onto the Internet. It is still
a big challenge that requires understanding the relation of the objects present
in the image and their attributes and actions they are involved in. Then,
visual interpretation methods are needed, but linguistic models are also
necessary to verbally describe the semantic relations. This problem is known as
Image Captioning. Although many datasets were proposed in the literature, the
majority contains only English captions, whereas datasets with captions
described in other languages are scarce. Recently, a movement called PraCegoVer
arose on the Internet, stimulating users from social media to publish images,
tag #PraCegoVer and add a short description of their content. Thus, inspired by
this movement, we have proposed the #PraCegoVer, a multi-modal dataset with
Portuguese captions based on posts from Instagram. It is the first large
dataset for image captioning in Portuguese with freely annotated images.
Further, the captions in our dataset bring additional challenges to the
problem: first, in contrast to popular datasets such as MS COCO Captions,
#PraCegoVer has only one reference to each image; also, both mean and variance
of our reference sentence length are significantly greater than those in the MS
COCO Captions. These two characteristics contribute to making our dataset
interesting due to the linguistic aspect and the challenges that it introduces
to the image captioning problem. We publicly-share the dataset at
https://github.com/gabrielsantosrv/PraCegoVer.
Related papers
- An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance [53.974497865647336]
We take a first step towards translating images to make them culturally relevant.
We build three pipelines comprising state-of-the-art generative models to do the task.
We conduct a human evaluation of translated images to assess for cultural relevance and meaning preservation.
arXiv Detail & Related papers (2024-04-01T17:08:50Z) - Satellite Captioning: Large Language Models to Augment Labeling [0.0]
caption datasets present a much more difficult challenge due to language differences, grammar, and the time it takes for humans to generate them.
Current datasets have certainly provided many instances to work with, but it becomes problematic when a captioner may have a more limited vocabulary.
This paper aims to address this issue of potential information and communication shortcomings in caption datasets.
arXiv Detail & Related papers (2023-12-18T03:21:58Z) - C-CLIP: Contrastive Image-Text Encoders to Close the
Descriptive-Commentative Gap [0.5439020425819]
The interplay between the image and comment on a social media post is one of high importance for understanding its overall message.
Recent strides in multimodal embedding models, namely CLIP, have provided an avenue forward in relating image and text.
The current training regime for CLIP models is insufficient for matching content found on social media, regardless of site or language.
We show that training contrastive image-text encoders on explicitly commentative pairs results in large improvements in retrieval results.
arXiv Detail & Related papers (2023-09-06T19:03:49Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset [0.5893124686141781]
Resource-constrained languages like Bangla remain out of focus, predominantly due to a lack of standard datasets.
We present a new dataset BAN-Cap following the widely used Flickr8k dataset, where we collect Bangla captions of the images provided by qualified annotators.
We investigate the effect of text augmentation and demonstrate that an adaptive attention-based model combined with text augmentation using Contextualized Word Replacement (CWR) outperforms all state-of-the-art models for Bangla image captioning.
arXiv Detail & Related papers (2022-05-28T15:39:09Z) - Who's Waldo? Linking People Across Text and Images [56.40556801773923]
We present a task and benchmark dataset for person-centric visual grounding.
Our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues.
We propose a Transformer-based method that outperforms several strong baselines on this task.
arXiv Detail & Related papers (2021-08-16T17:36:49Z) - Pragmatic Issue-Sensitive Image Captioning [11.998287522410404]
We propose Issue-Sensitive Image Captioning (ISIC)
ISIC is a captioning system given a target image and an issue, which is a set of images partitioned in a way that specifies what information is relevant.
We show how ISIC can complement and enrich the related task of Visual Question Answering.
arXiv Detail & Related papers (2020-04-29T20:00:53Z) - Transform and Tell: Entity-Aware News Image Captioning [77.4898875082832]
We propose an end-to-end model which generates captions for images embedded in news articles.
We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism.
We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts.
arXiv Detail & Related papers (2020-04-17T05:44:37Z) - Captioning Images Taken by People Who Are Blind [25.263950448575923]
VizWiz-Captions consists of over 39,000 images originating from people who are blind that are each paired with five captions.
We analyze this dataset to (1) characterize the typical captions, (2) characterize the diversity of content found in the images, and (3) compare its content to that found in eight popular vision datasets.
arXiv Detail & Related papers (2020-02-20T04:36:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.