Discoverability in Satellite Imagery: A Good Sentence is Worth a
Thousand Pictures
- URL: http://arxiv.org/abs/2001.05839v1
- Date: Fri, 3 Jan 2020 20:41:18 GMT
- Title: Discoverability in Satellite Imagery: A Good Sentence is Worth a
Thousand Pictures
- Authors: David Noever, Wes Regian, Matt Ciolino, Josh Kalin, Dom Hambrick, Kaye
Blankenship
- Abstract summary: Small satellite constellations provide daily global coverage of the earth's landmass.
To extract text annotations from raw pixels requires two dependent machine learning models.
We evaluate seven models on the previously largest benchmark for satellite image captions.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Small satellite constellations provide daily global coverage of the earth's
landmass, but image enrichment relies on automating key tasks like change
detection or feature searches. For example, to extract text annotations from
raw pixels requires two dependent machine learning models, one to analyze the
overhead image and the other to generate a descriptive caption. We evaluate
seven models on the previously largest benchmark for satellite image captions.
We extend the labeled image samples five-fold, then augment, correct and prune
the vocabulary to approach a rough min-max (minimum word, maximum description).
This outcome compares favorably to previous work with large pre-trained image
models but offers a hundred-fold reduction in model size without sacrificing
overall accuracy (when measured with log entropy loss). These smaller models
provide new deployment opportunities, particularly when pushed to edge
processors, on-board satellites, or distributed ground stations. To quantify a
caption's descriptiveness, we introduce a novel multi-class confusion or error
matrix to score both human-labeled test data and never-labeled images that
include bounding box detection but lack full sentence captions. This work
suggests future captioning strategies, particularly ones that can enrich the
class coverage beyond land use applications and that lessen color-centered and
adjacency adjectives ("green", "near", "between", etc.). Many modern language
transformers present novel and exploitable models with world knowledge gleaned
from training from their vast online corpus. One interesting, but easy example
might learn the word association between wind and waves, thus enriching a beach
scene with more than just color descriptions that otherwise might be accessed
from raw pixels without text annotation.
Related papers
- Satellite Captioning: Large Language Models to Augment Labeling [0.0]
caption datasets present a much more difficult challenge due to language differences, grammar, and the time it takes for humans to generate them.
Current datasets have certainly provided many instances to work with, but it becomes problematic when a captioner may have a more limited vocabulary.
This paper aims to address this issue of potential information and communication shortcomings in caption datasets.
arXiv Detail & Related papers (2023-12-18T03:21:58Z) - Diversified in-domain synthesis with efficient fine-tuning for few-shot
classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class.
We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data.
We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z) - Visually grounded few-shot word learning in low-resource settings [23.826000011632917]
We propose a visual grounded speech model that learns new words and their visual depictions from just a few word-image example pairs.
Our approach involves using the given word-image example pairs to mine new unsupervised word-image training pairs from large collections of unlabelled speech and images.
With this new model, we achieve better performance with fewer shots than previous approaches on an existing English benchmark.
arXiv Detail & Related papers (2023-06-20T08:27:42Z) - Image Captioning with Multi-Context Synthetic Data [16.961112970612447]
Large models have excelled in producing high-quality images and text.
We present an innovative pipeline that introduces multi-context data generation.
Our model is exclusively trained on synthetic image-text pairs crafted through this process.
arXiv Detail & Related papers (2023-05-29T13:18:59Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - Image Retrieval from Contextual Descriptions [22.084939474881796]
Image Retrieval from Contextual Descriptions (ImageCoDe)
Models tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description.
Best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.
arXiv Detail & Related papers (2022-03-29T19:18:12Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - Describe What to Change: A Text-guided Unsupervised Image-to-Image
Translation Approach [84.22327278486846]
We propose a novel unsupervised approach, based on image-to-image translation, that alters the attributes of a given image through a command-like sentence.
Our model disentangles the image content from the visual attributes, and it learns to modify the latter using the textual description.
Experiments show that the proposed model achieves promising performances on two large-scale public datasets.
arXiv Detail & Related papers (2020-08-10T15:40:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.