Related papers: Satellite Captioning: Large Language Models to Augment Labeling

Satellite Captioning: Large Language Models to Augment Labeling

URL: http://arxiv.org/abs/2312.10905v1
Date: Mon, 18 Dec 2023 03:21:58 GMT
Title: Satellite Captioning: Large Language Models to Augment Labeling
Authors: Grant Rosario, David Noever
Abstract summary: caption datasets present a much more difficult challenge due to language differences, grammar, and the time it takes for humans to generate them. Current datasets have certainly provided many instances to work with, but it becomes problematic when a captioner may have a more limited vocabulary. This paper aims to address this issue of potential information and communication shortcomings in caption datasets.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the growing capabilities of modern object detection networks and datasets to train them, it has gotten more straightforward and, importantly, less laborious to get up and running with a model that is quite adept at detecting any number of various objects. However, while image datasets for object detection have grown and continue to proliferate (the current most extensive public set, ImageNet, contains over 14m images with over 14m instances), the same cannot be said for textual caption datasets. While they have certainly been growing in recent years, caption datasets present a much more difficult challenge due to language differences, grammar, and the time it takes for humans to generate them. Current datasets have certainly provided many instances to work with, but it becomes problematic when a captioner may have a more limited vocabulary, one may not be adequately fluent in the language, or there are simple grammatical mistakes. These difficulties are increased when the images get more specific, such as remote sensing images. This paper aims to address this issue of potential information and communication shortcomings in caption datasets. To provide a more precise analysis, we specify our domain of images to be remote sensing images in the RSICD dataset and experiment with the captions provided here. Our findings indicate that ChatGPT grammar correction is a simple and effective way to increase the performance accuracy of caption models by making data captions more diverse and grammatically correct.

Related papers

BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions [118.35194230865451]
We introduce BLIP3-KALE, a dataset of 218 million image-text pairs. KALE augments synthetic dense image captions with web-scale alt-text to generate factually grounded image captions. We train vision-language models on KALE and demonstrate improvements on vision-language tasks.
arXiv Detail & Related papers (2024-11-12T00:52:52Z)
TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset. It contains 39,153 text-rich images, captions, and 102,437 questions. We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z)
Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet. On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z)
Text Data-Centric Image Captioning with Interactive Prompts [20.48013600818985]
Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data. This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap.
arXiv Detail & Related papers (2024-03-28T07:43:49Z)
Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text. Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z)
Diversify Your Vision Datasets with Automatic Diffusion-Based Augmentation [66.6546668043249]
ALIA (Automated Language-guided Image Augmentation) is a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains. To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information. We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks.
arXiv Detail & Related papers (2023-05-25T17:43:05Z)
NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos. We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images. We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z)
BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset [0.5893124686141781]
Resource-constrained languages like Bangla remain out of focus, predominantly due to a lack of standard datasets. We present a new dataset BAN-Cap following the widely used Flickr8k dataset, where we collect Bangla captions of the images provided by qualified annotators. We investigate the effect of text augmentation and demonstrate that an adaptive attention-based model combined with text augmentation using Contextualized Word Replacement (CWR) outperforms all state-of-the-art models for Bangla image captioning.
arXiv Detail & Related papers (2022-05-28T15:39:09Z)
Bench-Marking And Improving Arabic Automatic Image Captioning Through The Use Of Multi-Task Learning Paradigm [0.0]
This paper explores methods and techniques that could enhance the performance of Arabic image captioning. The use of multi-task learning and pre-trained word embeddings noticeably enhanced the quality of image captioning. However, the presented results shows that Arabic captioning still lags behind when compared to the English language.
arXiv Detail & Related papers (2022-02-11T06:29:25Z)
#PraCegoVer: A Large Dataset for Image Captioning in Portuguese [6.890235464357029]
#PraCegoVer is the first large dataset for image captioning in Portuguese with freely annotated images. A movement called PraCegoVer arose on the Internet, stimulating users from social media to publish images, tag #PraCegoVer and add a short description of their content.
arXiv Detail & Related papers (2021-03-21T19:55:46Z)
Discoverability in Satellite Imagery: A Good Sentence is Worth a Thousand Pictures [0.0]
Small satellite constellations provide daily global coverage of the earth's landmass. To extract text annotations from raw pixels requires two dependent machine learning models. We evaluate seven models on the previously largest benchmark for satellite image captions.
arXiv Detail & Related papers (2020-01-03T20:41:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.