Transformer-based Spatial Grounding: A Comprehensive Survey
- URL: http://arxiv.org/abs/2507.12739v1
- Date: Thu, 17 Jul 2025 02:44:01 GMT
- Title: Transformer-based Spatial Grounding: A Comprehensive Survey
- Authors: Ijazul Haq, Muhammad Saqib, Yingjie Zhang,
- Abstract summary: This paper presents a systematic literature review of transformer-based spatial grounding approaches from 2018 to 2025.<n>Our analysis identifies dominant model architectures, prevalent datasets, and widely adopted evaluation metrics.<n>This study provides essential insights and structured guidance for researchers and practitioners, facilitating the development of robust, reliable, and industry-ready models.
- Score: 3.309903719647421
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spatial grounding, the process of associating natural language expressions with corresponding image regions, has rapidly advanced due to the introduction of transformer-based models, significantly enhancing multimodal representation and cross-modal alignment. Despite this progress, the field lacks a comprehensive synthesis of current methodologies, dataset usage, evaluation metrics, and industrial applicability. This paper presents a systematic literature review of transformer-based spatial grounding approaches from 2018 to 2025. Our analysis identifies dominant model architectures, prevalent datasets, and widely adopted evaluation metrics, alongside highlighting key methodological trends and best practices. This study provides essential insights and structured guidance for researchers and practitioners, facilitating the development of robust, reliable, and industry-ready transformer-based spatial grounding models.
Related papers
- Foundation Models and Transformers for Anomaly Detection: A Survey [2.3264194695971656]
The survey categorizes VAD methods into reconstruction-based, feature-based and zero/few-shot approaches.<n>Transformers and foundation models enable more robust, interpretable, and scalable anomaly detection solutions.
arXiv Detail & Related papers (2025-07-21T12:01:04Z) - A Sensor Agnostic Domain Generalization Framework for Leveraging Geospatial Foundation Models: Enhancing Semantic Segmentation viaSynergistic Pseudo-Labeling and Generative Learning [5.299218284699214]
High-performance segmentation models are challenged by annotation scarcity and variability across sensors, illumination, and geography.<n>This paper introduces a domain generalization approach to leveraging emerging geospatial foundation models by combining soft-alignment pseudo-labeling with source-to-target generative pre-training.<n> Experiments with hyperspectral and multispectral remote sensing datasets confirm our method's effectiveness in enhancing adaptability and segmentation.
arXiv Detail & Related papers (2025-05-02T19:52:02Z) - A Concise Survey on Lane Topology Reasoning for HD Mapping [30.73664953504888]
Lane topology reasoning techniques play a crucial role in high-definition (HD) mapping and autonomous driving applications.<n>Recent years have witnessed significant advances in this field, but there has been limited effort to consolidate these works into a comprehensive overview.<n>This survey systematically reviews the evolution and current state of lane topology reasoning methods.
arXiv Detail & Related papers (2025-03-31T11:30:40Z) - A Survey of Model Architectures in Information Retrieval [64.75808744228067]
We focus on two key aspects: backbone models for feature extraction and end-to-end system architectures for relevance estimation.<n>We trace the development from traditional term-based methods to modern neural approaches, particularly highlighting the impact of transformer-based models and subsequent large language models (LLMs)<n>We conclude by discussing emerging challenges and future directions, including architectural optimizations for performance and scalability, handling of multimodal, multilingual data, and adaptation to novel application domains beyond traditional search paradigms.
arXiv Detail & Related papers (2025-02-20T18:42:58Z) - Trajectory World Models for Heterogeneous Environments [67.27233466954814]
Heterogeneity in sensors and actuators across environments poses a significant challenge to building large-scale pre-trained world models.<n>We introduce UniTraj, a unified dataset comprising over one million trajectories from 80 environments, designed to scale data while preserving critical diversity.<n>We also propose TrajWorld, a novel architecture capable of handling varying sensor and actuator information and capturing environment dynamics in-context.
arXiv Detail & Related papers (2025-02-03T13:59:08Z) - Quantifying the synthetic and real domain gap in aerial scene understanding [1.696456370910212]
This paper introduces a novel methodology for scene complexity assessment using Multi-Model Consensus Metric (MMCM) and depth-based structural metrics.<n>Our experimental analysis, utilizing real-world (Dronescapes) and synthetic (Skyscenes) datasets, demonstrates that real-world scenes generally exhibit higher consensus among state-of-the-art vision transformers.<n>The results underline the inherent complexities and domain gaps, emphasizing the need for enhanced simulation fidelity and model generalization.
arXiv Detail & Related papers (2024-11-29T18:18:26Z) - A Data-Driven Review of Remote Sensing-Based Data Fusion in Precision Agriculture from Foundational to Transformer-Based Techniques [6.184871136700834]
This review offers valuable insights for advancing precision agriculture through AI-driven data fusion techniques.<n>We analyze research trends from 1994 to 2024, identifying key developments in data fusion, remote sensing, and AI-driven agricultural monitoring.
arXiv Detail & Related papers (2024-10-24T01:26:21Z) - Learning to Generalize Unseen Domains via Multi-Source Meta Learning for Text Classification [71.08024880298613]
We study the multi-source Domain Generalization of text classification.
We propose a framework to use multiple seen domains to train a model that can achieve high accuracy in an unseen domain.
arXiv Detail & Related papers (2024-09-20T07:46:21Z) - A Comprehensive Survey on Applications of Transformers for Deep Learning
Tasks [60.38369406877899]
Transformer is a deep neural network that employs a self-attention mechanism to comprehend the contextual relationships within sequential data.
transformer models excel in handling long dependencies between input sequence elements and enable parallel processing.
Our survey encompasses the identification of the top five application domains for transformer-based models.
arXiv Detail & Related papers (2023-06-11T23:13:51Z) - GEO-Bench: Toward Foundation Models for Earth Monitoring [139.77907168809085]
We propose a benchmark comprised of six classification and six segmentation tasks.
This benchmark will be a driver of progress across a variety of Earth monitoring tasks.
arXiv Detail & Related papers (2023-06-06T16:16:05Z) - How Far are We from Effective Context Modeling? An Exploratory Study on
Semantic Parsing in Context [59.13515950353125]
We present a grammar-based decoding semantic parsing and adapt typical context modeling methods on top of it.
We evaluate 13 context modeling methods on two large cross-domain datasets, and our best model achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-02-03T11:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.