Satellite Captioning: Large Language Models to Augment Labeling
- URL: http://arxiv.org/abs/2312.10905v1
- Date: Mon, 18 Dec 2023 03:21:58 GMT
- Title: Satellite Captioning: Large Language Models to Augment Labeling
- Authors: Grant Rosario, David Noever
- Abstract summary: caption datasets present a much more difficult challenge due to language differences, grammar, and the time it takes for humans to generate them.
Current datasets have certainly provided many instances to work with, but it becomes problematic when a captioner may have a more limited vocabulary.
This paper aims to address this issue of potential information and communication shortcomings in caption datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the growing capabilities of modern object detection networks and
datasets to train them, it has gotten more straightforward and, importantly,
less laborious to get up and running with a model that is quite adept at
detecting any number of various objects. However, while image datasets for
object detection have grown and continue to proliferate (the current most
extensive public set, ImageNet, contains over 14m images with over 14m
instances), the same cannot be said for textual caption datasets. While they
have certainly been growing in recent years, caption datasets present a much
more difficult challenge due to language differences, grammar, and the time it
takes for humans to generate them. Current datasets have certainly provided
many instances to work with, but it becomes problematic when a captioner may
have a more limited vocabulary, one may not be adequately fluent in the
language, or there are simple grammatical mistakes. These difficulties are
increased when the images get more specific, such as remote sensing images.
This paper aims to address this issue of potential information and
communication shortcomings in caption datasets. To provide a more precise
analysis, we specify our domain of images to be remote sensing images in the
RSICD dataset and experiment with the captions provided here. Our findings
indicate that ChatGPT grammar correction is a simple and effective way to
increase the performance accuracy of caption models by making data captions
more diverse and grammatically correct.
Related papers
- LoTLIP: Improving Language-Image Pre-training for Long Text Understanding [71.04947115945349]
Long text understanding is of great demands in language-image pre-training models.
We relabel the data with long captions, however, directly learning with which may lead to performance degradation in understanding short text.
We validate the effectiveness of our approach using a self-constructed large-scale dataset.
It is noteworthy that, on the task of long-text image retrieval, we beat the competitor using long captions with 11.1% improvement.
arXiv Detail & Related papers (2024-10-07T17:52:56Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - Text Data-Centric Image Captioning with Interactive Prompts [20.48013600818985]
Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data.
This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap.
arXiv Detail & Related papers (2024-03-28T07:43:49Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Diversify Your Vision Datasets with Automatic Diffusion-Based
Augmentation [66.6546668043249]
ALIA (Automated Language-guided Image Augmentation) is a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains.
To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information.
We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks.
arXiv Detail & Related papers (2023-05-25T17:43:05Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset [0.5893124686141781]
Resource-constrained languages like Bangla remain out of focus, predominantly due to a lack of standard datasets.
We present a new dataset BAN-Cap following the widely used Flickr8k dataset, where we collect Bangla captions of the images provided by qualified annotators.
We investigate the effect of text augmentation and demonstrate that an adaptive attention-based model combined with text augmentation using Contextualized Word Replacement (CWR) outperforms all state-of-the-art models for Bangla image captioning.
arXiv Detail & Related papers (2022-05-28T15:39:09Z) - Bench-Marking And Improving Arabic Automatic Image Captioning Through
The Use Of Multi-Task Learning Paradigm [0.0]
This paper explores methods and techniques that could enhance the performance of Arabic image captioning.
The use of multi-task learning and pre-trained word embeddings noticeably enhanced the quality of image captioning.
However, the presented results shows that Arabic captioning still lags behind when compared to the English language.
arXiv Detail & Related papers (2022-02-11T06:29:25Z) - #PraCegoVer: A Large Dataset for Image Captioning in Portuguese [6.890235464357029]
#PraCegoVer is the first large dataset for image captioning in Portuguese with freely annotated images.
A movement called PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content.
arXiv Detail & Related papers (2021-03-21T19:55:46Z) - Discoverability in Satellite Imagery: A Good Sentence is Worth a
Thousand Pictures [0.0]
Small satellite constellations provide daily global coverage of the earth's landmass.
To extract text annotations from raw pixels requires two dependent machine learning models.
We evaluate seven models on the previously largest benchmark for satellite image captions.
arXiv Detail & Related papers (2020-01-03T20:41:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.