Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations
- URL: http://arxiv.org/abs/2409.07048v1
- Date: Wed, 11 Sep 2024 06:36:08 GMT
- Title: Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations
- Authors: Keumgang Cha, Donggeun Yu, Junghoon Seo,
- Abstract summary: This study introduces an approach to curate vision-language datasets by employing an image decoding machine learning model.
We amassed approximately 9.6 million vision-language paired datasets in VHR imagery.
The resultant model outperformed counterparts that did not leverage publicly available vision-language datasets.
- Score: 5.065947993017157
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The prominence of generalized foundation models in vision-language integration has witnessed a surge, given their multifarious applications. Within the natural domain, the procurement of vision-language datasets to construct these foundation models is facilitated by their abundant availability and the ease of web crawling. Conversely, in the remote sensing domain, although vision-language datasets exist, their volume is suboptimal for constructing robust foundation models. This study introduces an approach to curate vision-language datasets by employing an image decoding machine learning model, negating the need for human-annotated labels. Utilizing this methodology, we amassed approximately 9.6 million vision-language paired datasets in VHR imagery. The resultant model outperformed counterparts that did not leverage publicly available vision-language datasets, particularly in downstream tasks such as zero-shot classification, semantic localization, and image-text retrieval. Moreover, in tasks exclusively employing vision encoders, such as linear probing and k-NN classification, our model demonstrated superior efficacy compared to those relying on domain-specific vision-language datasets.
Related papers
- Multilingual Vision-Language Pre-training for the Remote Sensing Domain [4.118895088882213]
Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data.
This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model.
Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks.
arXiv Detail & Related papers (2024-10-30T18:13:11Z) - VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks [6.731844884087068]
We propose VolDoGer: Vision-Language dataset for Domain Generalization.
This dataset addresses three vision-language tasks: image captioning, visual question answering, and visual entailment.
We extend LLM-based data annotation techniques to vision-language tasks, thereby alleviating the burden of recruiting human annotators.
arXiv Detail & Related papers (2024-07-29T08:38:46Z) - Learning Visual Grounding from Generative Vision and Language Model [29.2712567454021]
Visual grounding tasks aim to localize image regions based on natural language references.
We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting.
Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world.
arXiv Detail & Related papers (2024-07-18T20:29:49Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Making the Most of Text Semantics to Improve Biomedical Vision--Language
Processing [17.96645738679543]
We show that textual semantic modelling can substantially improve contrastive learning in self-supervised vision--language processing.
We propose a self-supervised joint vision--language approach with a focus on better text modelling.
arXiv Detail & Related papers (2022-04-21T00:04:35Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.