Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided
Visual Foundation Models
- URL: http://arxiv.org/abs/2304.10597v1
- Date: Thu, 20 Apr 2023 18:39:41 GMT
- Title: Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided
Visual Foundation Models
- Authors: Jielu Zhang, Zhongliang Zhou, Gengchen Mai, Lan Mu, Mengxuan Hu, Sheng
Li
- Abstract summary: This study focuses on the remote sensing domain, where the images are notably dissimilar from those in conventional scenarios.
We developed a pipeline that leverages multiple foundation models to facilitate remote sensing image semantic segmentation tasks guided by text prompt.
The pipeline is benchmarked on several widely-used remote sensing datasets, and we present preliminary results to demonstrate its effectiveness.
- Score: 5.360103006279672
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in foundation models (FMs), such as GPT-4 and LLaMA, have
attracted significant attention due to their exceptional performance in
zero-shot learning scenarios. Similarly, in the field of visual learning,
models like Grounding DINO and the Segment Anything Model (SAM) have exhibited
remarkable progress in open-set detection and instance segmentation tasks. It
is undeniable that these FMs will profoundly impact a wide range of real-world
visual learning tasks, ushering in a new paradigm shift for developing such
models. In this study, we concentrate on the remote sensing domain, where the
images are notably dissimilar from those in conventional scenarios. We
developed a pipeline that leverages multiple FMs to facilitate remote sensing
image semantic segmentation tasks guided by text prompt, which we denote as
Text2Seg. The pipeline is benchmarked on several widely-used remote sensing
datasets, and we present preliminary results to demonstrate its effectiveness.
Through this work, we aim to provide insights into maximizing the applicability
of visual FMs in specific contexts with minimal model tuning. The code is
available at https://github.com/Douglas2Code/Text2Seg.
Related papers
- Towards Vision-Language Geo-Foundation Model: A Survey [65.70547895998541]
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks.
This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2024-06-13T17:57:30Z) - Multi-Granularity Language-Guided Multi-Object Tracking [95.91263758294154]
We propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity.
At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions.
Our LG-MOT achieves an absolute gain of 2.2% in terms of target object association (IDF1 score) compared to the baseline using only visual features.
arXiv Detail & Related papers (2024-06-07T11:18:40Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - AM-RADIO: Agglomerative Vision Foundation Model -- Reduce All Domains Into One [47.58919672657824]
We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One)
We develop a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 7x faster than the teacher models.
Our comprehensive benchmarking process covers downstream tasks including ImageNet classification, ADE20k semantic segmentation, COCO object detection and LLaVa-1.5 framework.
arXiv Detail & Related papers (2023-12-10T17:07:29Z) - Adapting Segment Anything Model for Change Detection in HR Remote
Sensing Images [18.371087310792287]
This work aims to utilize the strong visual recognition capabilities of Vision Foundation Models (VFMs) to improve the change detection of high-resolution Remote Sensing Images (RSIs)
We employ the visual encoder of FastSAM, an efficient variant of the SAM, to extract visual representations in RS scenes.
To utilize the semantic representations that are inherent to SAM features, we introduce a task-agnostic semantic learning branch to model the semantic latent in bi-temporal RSIs.
The resulting method, SAMCD, obtains superior accuracy compared to the SOTA methods and exhibits a sample-efficient learning ability that is comparable to semi-
arXiv Detail & Related papers (2023-09-04T08:23:31Z) - RRSIS: Referring Remote Sensing Image Segmentation [25.538406069768662]
Localizing desired objects from remote sensing images is of great use in practical applications.
Referring image segmentation, which aims at segmenting out the objects to which a given expression refers, has been extensively studied in natural images.
We introduce referring remote sensing image segmentation (RRSIS) to fill in this gap and make some insightful explorations.
arXiv Detail & Related papers (2023-06-14T16:40:19Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Extending global-local view alignment for self-supervised learning with remote sensing imagery [1.5192294544599656]
Self-supervised models acquire general feature representations by formulating a pretext task that generates pseudo-labels for massive unlabeled data.
Inspired by DINO, we formulate two pretext tasks for self-supervised learning on remote sensing imagery (SSLRS)
We extend DINO and propose DINO-MC which uses local views of various sized crops instead of a single fixed size.
arXiv Detail & Related papers (2023-03-12T14:24:10Z) - Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections.
We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing.
We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [144.38869017091199]
Vision transformers (ViTs) in image classification have shifted the methodologies for visual representation learning.
In this work, we explore the global context learning potentials of ViTs for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Referring Transformer: A One-step Approach to Multi-task Visual
Grounding [45.42959940733406]
We propose a simple one-stage multi-task framework for visual grounding tasks.
Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder.
We show that our model benefits greatly from contextualized information and multi-task training.
arXiv Detail & Related papers (2021-06-06T10:53:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.