InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition
- URL: http://arxiv.org/abs/2505.15818v1
- Date: Wed, 21 May 2025 17:59:56 GMT
- Title: InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition
- Authors: Yijie Zheng, Weijie Wu, Qingyun Li, Xuehui Wang, Xu Zhou, Aiai Ren, Jun Shen, Long Zhao, Guoqing Li, Xue Yang,
- Abstract summary: InstructSAM is a training-free framework for instruction-driven object recognition.<n>We present EarthInstruct, the first InstructCDS benchmark for earth observation.
- Score: 19.74617806521803
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language-Guided object recognition in remote sensing imagery is crucial for large-scale mapping and automated data annotation. However, existing open-vocabulary and visual grounding methods rely on explicit category cues, limiting their ability to handle complex or implicit queries that require advanced reasoning. To address this issue, we introduce a new suite of tasks, including Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS), covering open-vocabulary, open-ended, and open-subclass scenarios. We further present EarthInstruct, the first InstructCDS benchmark for earth observation. It is constructed from two diverse remote sensing datasets with varying spatial resolutions and annotation rules across 20 categories, necessitating models to interpret dataset-specific instructions. Given the scarcity of semantically rich labeled data in remote sensing, we propose InstructSAM, a training-free framework for instruction-driven object recognition. InstructSAM leverages large vision-language models to interpret user instructions and estimate object counts, employs SAM2 for mask proposal, and formulates mask-label assignment as a binary integer programming problem. By integrating semantic similarity with counting constraints, InstructSAM efficiently assigns categories to predicted masks without relying on confidence thresholds. Experiments demonstrate that InstructSAM matches or surpasses specialized baselines across multiple tasks while maintaining near-constant inference time regardless of object count, reducing output tokens by 89% and overall runtime by over 32% compared to direct generation approaches. We believe the contributions of the proposed tasks, benchmark, and effective approach will advance future research in developing versatile object recognition systems.
Related papers
- Ambiguity Resolution in Text-to-Structured Data Mapping [10.285528620331696]
Ambiguity in natural language is a significant obstacle for achieving accurate text to structured data mapping.<n>We propose a new framework to improve the performance of large language models (LLMs) on ambiguous agentic tool calling through missing concepts prediction.
arXiv Detail & Related papers (2025-05-16T20:39:30Z) - Pluralistic Salient Object Detection [108.74650817891984]
We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image.
We present two new SOD datasets "DUTS-MM" and "DUS-MQ", along with newly designed evaluation metrics.
arXiv Detail & Related papers (2024-09-04T01:38:37Z) - Exploring Robust Features for Few-Shot Object Detection in Satellite
Imagery [17.156864650143678]
We develop a few-shot object detector based on a traditional two-stage architecture.
A large-scale pre-trained model is used to build class-reference embeddings or prototypes.
We perform evaluations on two remote sensing datasets containing challenging and rare objects.
arXiv Detail & Related papers (2024-03-08T15:20:27Z) - IntenDD: A Unified Contrastive Learning Approach for Intent Detection
and Discovery [12.905097743551774]
We propose IntenDD, a unified approach leveraging a shared utterance encoding backbone.
IntenDD uses an entirely unsupervised contrastive learning strategy for representation learning.
We find that our approach consistently outperforms competitive baselines across all three tasks.
arXiv Detail & Related papers (2023-10-25T16:50:24Z) - Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence.
We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps.
We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z) - LISA: Reasoning Segmentation via Large Language Model [68.24075852136761]
We propose a new segmentation task -- reasoning segmentation.
The task is designed to output a segmentation mask given a complex and implicit query text.
We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
arXiv Detail & Related papers (2023-08-01T17:50:17Z) - Training-free Object Counting with Prompts [12.358565655046977]
Existing approaches rely on extensive training data with point annotations for each object.
We propose a training-free object counter that treats the counting task as a segmentation problem.
arXiv Detail & Related papers (2023-06-30T13:26:30Z) - Self-Supervised Interactive Object Segmentation Through a
Singulation-and-Grasping Approach [9.029861710944704]
We propose a robot learning approach to interact with novel objects and collect each object's training label.
The Singulation-and-Grasping (SaG) policy is trained through end-to-end reinforcement learning.
Our system achieves 70% singulation success rate in simulated cluttered scenes.
arXiv Detail & Related papers (2022-07-19T15:01:36Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - Words aren't enough, their order matters: On the Robustness of Grounding
Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition.
We show that 83.7% of test instances do not require reasoning on linguistic structure.
We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z) - Weakly-Supervised Salient Object Detection via Scribble Annotations [54.40518383782725]
We propose a weakly-supervised salient object detection model to learn saliency from scribble labels.
We present a new metric, termed saliency structure measure, to measure the structure alignment of the predicted saliency maps.
Our method not only outperforms existing weakly-supervised/unsupervised methods, but also is on par with several fully-supervised state-of-the-art models.
arXiv Detail & Related papers (2020-03-17T12:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.