REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing
- URL: http://arxiv.org/abs/2511.17442v1
- Date: Fri, 21 Nov 2025 17:41:26 GMT
- Title: REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing
- Authors: Binger Chen, Tacettin Emre Bök, Behnood Rasti, Volker Markl, Begüm Demir,
- Abstract summary: Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping.<n>We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms.<n>We present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries.
- Score: 16.401094981355218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping. These models include unimodal vision encoders trained on a single data modality and multimodal architectures trained on combinations of SAR, multispectral, hyperspectral, and image-text data. They support diverse RS tasks including semantic segmentation, image classification, change detection, and visual question answering. However, selecting an appropriate remote sensing foundation model (RSFM) remains difficult due to scattered documentation, heterogeneous formats, and varied deployment constraints. We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms. Built on RS-FMD, we present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries. REMSA interprets user requirements, resolves missing constraints, ranks candidate models using in-context learning, and provides transparent justifications. We also propose a benchmark of 75 expert-verified RS query scenarios, producing 900 configurations under an expert-centered evaluation protocol. REMSA outperforms several baselines, including naive agents, dense retrieval, and unstructured RAG-based LLMs. It operates entirely on publicly available metadata and does not access private or sensitive data.
Related papers
- OFA-MAS: One-for-All Multi-Agent System Topology Design based on Mixture-of-Experts Graph Generative Models [57.94189874119267]
Multi-Agent Systems (MAS) offer a powerful paradigm for solving complex problems.<n>Current graph learning-based design methodologies often adhere to a "one-for-one" paradigm.<n>We propose OFA-TAD, a one-for-all framework that generates adaptive collaboration graphs for any task described in natural language.
arXiv Detail & Related papers (2026-01-19T12:23:44Z) - Co-Training Vision Language Models for Remote Sensing Multi-task Learning [68.15604397741753]
Vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning.<n>We present RSCoVLM, a simple yet flexible VLM baseline for RS MTL.<n>We propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery.
arXiv Detail & Related papers (2025-11-26T10:55:07Z) - MSRS: Evaluating Multi-Source Retrieval-Augmented Generation [51.717139132190574]
Many real-world applications demand the ability to integrate and summarize information scattered across multiple sources.<n>We present a scalable framework for constructing evaluation benchmarks that challenge RAG systems to integrate information across distinct sources.
arXiv Detail & Related papers (2025-08-28T14:59:55Z) - Region-Level Context-Aware Multimodal Understanding [52.565528550835786]
Region-level Context-aware Multimodal Understanding (RCMU) is an ability to integrate textual context associated with objects for a more context-aware multimodal understanding.<n>To equip MLLMs with RCMU capabilities, we propose Region-level Context-aware Visual Instruction Tuning (RCVIT)<n>We introduce the RCMU dataset, a large-scale visual instruction tuning dataset that covers multiple RCMU tasks.<n>We also propose RC&P-Bench, a comprehensive benchmark that can evaluate the performance of MLLMs in RCMU and multimodal personalized understanding tasks.
arXiv Detail & Related papers (2025-08-17T07:18:43Z) - RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning [15.670921552151775]
RingMo-Agent is designed to handle multi-modal and multi-platform data.<n>It is supported by a large-scale vision-language dataset named RS-VL3M.<n>It proves effective in both visual understanding and sophisticated analytical tasks.
arXiv Detail & Related papers (2025-07-28T12:39:33Z) - Multimodal Information Retrieval for Open World with Edit Distance Weak Supervision [0.0]
"FemmIR" is a framework to retrieve results relevant to information needs expressed with multimodal queries by example without any similarity label.<n>We empirically evaluate FemmIR on a missing person use case with MuQNOL.
arXiv Detail & Related papers (2025-06-25T00:25:08Z) - New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration [49.180693704510006]
Referring Expression (REC) is a cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding.<n>It serves as an essential testing ground for Multimodal Large Language Models (MLLMs)
arXiv Detail & Related papers (2025-02-27T13:58:44Z) - Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts [56.7225771305861]
This paper introduces Multi-Modal Retrieval-Augmented Generation (M$2$RAG), a benchmark designed to evaluate the effectiveness of Multi-modal Large Language Models.<n>The benchmark comprises four tasks: image captioning, multi-modal question answering, multi-modal fact verification, and image reranking.<n>To enhance the context utilization capabilities of MLLMs, we also introduce Multi-Modal Retrieval-Augmented Instruction Tuning (MM-RAIT)
arXiv Detail & Related papers (2025-02-24T16:25:25Z) - A Simple Aerial Detection Baseline of Multimodal Language Models [33.91030170608569]
We present a simple baseline for applying multimodal aerial detection for the first time, named LMMRotate.<n>We construct the baseline by fine-tuning open-source general-purposes and achieve impressive detection performance comparable to conventional detector.
arXiv Detail & Related papers (2025-01-16T18:09:22Z) - MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs [78.5013630951288]
This paper introduces techniques for advancing information retrieval with multimodal large language models (MLLMs)<n>We first study fine-tuning an MLLM as a bi-encoder retriever on 10 datasets with 16 retrieval tasks.<n>Our model, MM-Embed, achieves state-of-the-art performance on the multimodal retrieval benchmark M-BEIR.
arXiv Detail & Related papers (2024-11-04T20:06:34Z) - SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding [26.08043905865113]
We propose a large-scale instruction tuning dataset FIT-RS, containing 1,800,851 instruction samples.
FIT-RS covers common interpretation tasks and innovatively introduces several complex comprehension tasks of escalating difficulty.
We establish a new benchmark to evaluate the fine-grained relation comprehension capabilities of LMMs, named FIT-RSRC.
arXiv Detail & Related papers (2024-06-14T14:57:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.