Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries
- URL: http://arxiv.org/abs/2512.14102v1
- Date: Tue, 16 Dec 2025 05:33:44 GMT
- Title: Neurosymbolic Inference On Foundation Models For Remote Sensing Text-to-image Retrieval With Complex Queries
- Authors: Emanuele Mezzi, Gertjan Burghouts, Maarten Kruithof,
- Abstract summary: RUNE (Reasoning Using Neurosymbolic Entities) is an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images.<n>Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability.<n>We show RUNE's potential for real-world RS applications through a use case on post-flood satellite image retrieval.
- Score: 0.12744523252873352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image retrieval in remote sensing (RS) has advanced rapidly with the rise of large vision-language models (LVLMs) tailored for aerial and satellite imagery, culminating in remote sensing large vision-language models (RS-LVLMS). However, limited explainability and poor handling of complex spatial relations remain key challenges for real-world use. To address these issues, we introduce RUNE (Reasoning Using Neurosymbolic Entities), an approach that combines Large Language Models (LLMs) with neurosymbolic AI to retrieve images by reasoning over the compatibility between detected entities and First-Order Logic (FOL) expressions derived from text queries. Unlike RS-LVLMs that rely on implicit joint embeddings, RUNE performs explicit reasoning, enhancing performance and interpretability. For scalability, we propose a logic decomposition strategy that operates on conditioned subsets of detected entities, guaranteeing shorter execution time compared to neural approaches. Rather than using foundation models for end-to-end retrieval, we leverage them only to generate FOL expressions, delegating reasoning to a neurosymbolic inference module. For evaluation we repurpose the DOTA dataset, originally designed for object detection, by augmenting it with more complex queries than in existing benchmarks. We show the LLM's effectiveness in text-to-logic translation and compare RUNE with state-of-the-art RS-LVLMs, demonstrating superior performance. We introduce two metrics, Retrieval Robustness to Query Complexity (RRQC) and Retrieval Robustness to Image Uncertainty (RRIU), which evaluate performance relative to query complexity and image uncertainty. RUNE outperforms joint-embedding models in complex RS retrieval tasks, offering gains in performance, robustness, and explainability. We show RUNE's potential for real-world RS applications through a use case on post-flood satellite image retrieval.
Related papers
- DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories [52.57197752244638]
We introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task.<n>Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues.<n>We construct DISBench, a challenging benchmark built on interconnected visual data.
arXiv Detail & Related papers (2026-02-11T12:51:10Z) - ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval [64.14282916266998]
Composed Image Retrieval aims to retrieve target images based on a hybrid query comprising a reference image and a modification text.<n>We propose ReCALL, a model-agnostic framework that follows a diagnose-generate-refine pipeline.<n>Experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-02T04:52:54Z) - Beyond Error-Based Optimization: Experience-Driven Symbolic Regression with Goal-Conditioned Reinforcement Learning [14.473539776112666]
We propose a novel framework named EGRL-SR (Experience-driven Goal-conditioned Reinforcement Learning for Regression)<n>We formulate symbolic regression as a goal-conditioned reinforcement learning problem and incorporate hindsight experience replay.<n>We design an all-point satisfaction binary reward function that encourages the action-value network to focus on structural patterns rather than low-error expressions.
arXiv Detail & Related papers (2026-01-21T06:08:37Z) - KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering [64.62317305868264]
We present textbfKBQA-R1, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning.<n>Treating KBQA as a multi-turn decision process, our model learns to navigate the knowledge base using a list of actions.<n>Experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-12-10T17:45:42Z) - Co-Training Vision Language Models for Remote Sensing Multi-task Learning [68.15604397741753]
Vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning.<n>We present RSCoVLM, a simple yet flexible VLM baseline for RS MTL.<n>We propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery.
arXiv Detail & Related papers (2025-11-26T10:55:07Z) - CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning [93.05917922306196]
Composed Image Retrieval (CIR) aims to find a target image from a reference image and a modification text.<n>CIR-CoT is the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2025-10-09T09:41:45Z) - Looking Alike From Far to Near: Enhancing Cross-Resolution Re-Identification via Feature Vector Panning [14.89776496894534]
Cross-Resolution ReID (CR-ReID) methods rely on super-resolution (SR) or joint learning for feature compensation.<n>Inspired by semantic directions in the word embedding space, we empirically discover that semantic directions implying resolution differences also emerge in the feature space of ReID.<n>We propose a lightweight and effective Vector Panning Feature Alignment (VPFA) framework, which conducts CR-ReID from a novel perspective.
arXiv Detail & Related papers (2025-10-01T14:15:39Z) - From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems [6.762635083456022]
We investigate how entity coreference affects both document retrieval and generative performance in RAG-based systems.<n>We demonstrate that coreference resolution enhances retrieval effectiveness and improves question-answering (QA) performance.<n>This study aims to provide a deeper understanding of the challenges posed by coreferential complexity in RAG, providing guidance for improving retrieval and generation in knowledge-intensive AI applications.
arXiv Detail & Related papers (2025-07-10T15:26:59Z) - Sparse Interpretable Deep Learning with LIES Networks for Symbolic Regression [22.345828337550575]
Symbolic regression aims to discover closed-form mathematical expressions that accurately describe data.<n>Existing SR methods often rely on population-based search or autoregressive modeling.<n>We introduce LIES (Logarithm, Identity, Exponential, Sine), a fixed neural network architecture with interpretable primitive activations that are optimized to model symbolic expressions.
arXiv Detail & Related papers (2025-06-09T22:05:53Z) - CoLLM: A Large Language Model for Composed Image Retrieval [76.29725148964368]
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query.<n>We present CoLLM, a one-stop framework that generates triplets on-the-fly from image-caption pairs.<n>We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts.
arXiv Detail & Related papers (2025-03-25T17:59:50Z) - ICF-SRSR: Invertible scale-Conditional Function for Self-Supervised
Real-world Single Image Super-Resolution [60.90817228730133]
Single image super-resolution (SISR) is a challenging problem that aims to up-sample a given low-resolution (LR) image to a high-resolution (HR) counterpart.
Recent approaches are trained on simulated LR images degraded by simplified down-sampling operators.
We propose a novel Invertible scale-Conditional Function (ICF) which can scale an input image and then restore the original input with different scale conditions.
arXiv Detail & Related papers (2023-07-24T12:42:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.