Related papers: RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning

RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning

URL: http://arxiv.org/abs/2507.20776v1
Date: Mon, 28 Jul 2025 12:39:33 GMT
Title: RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning
Authors: Huiyang Hu, Peijin Wang, Yingchao Feng, Kaiwen Wei, Wenxin Yin, Wenhui Diao, Mengyu Wang, Hanbo Bi, Kaiyue Kang, Tong Ling, Kun Fu, Xian Sun,
Abstract summary: RingMo-Agent is designed to handle multi-modal and multi-platform data.<n>It is supported by a large-scale vision-language dataset named RS-VL3M.<n>It proves effective in both visual understanding and sophisticated analytical tasks.
Score: 15.670921552151775
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Remote sensing (RS) images from multiple modalities and platforms exhibit diverse details due to differences in sensor characteristics and imaging perspectives. Existing vision-language research in RS largely relies on relatively homogeneous data sources. Moreover, they still remain limited to conventional visual perception tasks such as classification or captioning. As a result, these methods fail to serve as a unified and standalone framework capable of effectively handling RS imagery from diverse sources in real-world applications. To address these issues, we propose RingMo-Agent, a model designed to handle multi-modal and multi-platform data that performs perception and reasoning tasks based on user textual instructions. Compared with existing models, RingMo-Agent 1) is supported by a large-scale vision-language dataset named RS-VL3M, comprising over 3 million image-text pairs, spanning optical, SAR, and infrared (IR) modalities collected from both satellite and UAV platforms, covering perception and challenging reasoning tasks; 2) learns modality adaptive representations by incorporating separated embedding layers to construct isolated features for heterogeneous modalities and reduce cross-modal interference; 3) unifies task modeling by introducing task-specific tokens and employing a token-based high-dimensional hidden state decoding mechanism designed for long-horizon spatial tasks. Extensive experiments on various RS vision-language tasks demonstrate that RingMo-Agent not only proves effective in both visual understanding and sophisticated analytical tasks, but also exhibits strong generalizability across different platforms and sensing modalities.

Related papers

VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning [10.497961559068493]
Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes.<n>Existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage.<n>VisualTrans is the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios.
arXiv Detail & Related papers (2025-08-06T03:07:05Z)
MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection [58.67129770371016]
We propose a novel IRSTD framework that reimagines the IRSTD paradigm by incorporating textual metadata for scene-aware optimization.<n>AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy.
arXiv Detail & Related papers (2025-05-21T07:02:05Z)
Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation [7.992331117310217]
Referring remote sensing image segmentation (RRSIS) is a novel visual task in remote sensing images segmentation.<n>We design a multimodal-aware fusion network (MAFN) to achieve fine-grained alignment and fusion between the two modalities.
arXiv Detail & Related papers (2025-03-14T08:31:21Z)
UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models [23.044366104080822]
We introduce textbfUniRS, the first vision-language model bftextremote bftextsensing tasks across various types of visual input.<n>UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis.<n> Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks.
arXiv Detail & Related papers (2024-12-30T06:34:18Z)
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.<n>It is designed to accurately detect horizontal or oriented objects from any sensor modality.<n>This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z)
RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts [17.76606110070648]
We propose RSUniVLM, a unified, end-to-end RS VLM for comprehensive vision understanding across multiple granularity.<n> RSUniVLM performs effectively in multi-image analysis, with instances of change detection and change captioning.<n>We also construct a large-scale RS instruction-following dataset based on a variety of existing datasets in both RS and general domain.
arXiv Detail & Related papers (2024-12-07T15:11:21Z)
X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning [109.9413329636322]
This paper introduces an efficient framework that integrates multiple modalities (images, 3D, audio and video) to a frozen Large Language Models (LLMs) Our approach explores two distinct projection mechanisms: Q-Formers and Linear Projections (LPs)
arXiv Detail & Related papers (2023-11-30T18:43:51Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
Support-set based Multi-modal Representation Enhancement for Video Captioning [121.70886789958799]
We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples. Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements. During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
arXiv Detail & Related papers (2022-05-19T03:40:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.