ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning
- URL: http://arxiv.org/abs/2511.06316v1
- Date: Sun, 09 Nov 2025 10:44:26 GMT
- Title: ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning
- Authors: MD Thamed Bin Zaman Chowdhury, Moazzem Hossain,
- Abstract summary: Most low- and middle-income countries face a critical shortage of accurate, location-specific crash data.<n>Existing text-based geocoding tools perform poorly in multilingual and unstructured news environments.<n>This study introduces ALIGN, a vision-language framework that emulates human spatial reasoning to infer accident coordinates.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reliable geospatial information on road accidents is vital for safety analysis and infrastructure planning, yet most low- and middle-income countries continue to face a critical shortage of accurate, location-specific crash data. Existing text-based geocoding tools perform poorly in multilingual and unstructured news environments, where incomplete place descriptions and mixed Bangla-English scripts obscure spatial context. To address these limitations, this study introduces ALIGN (Accident Location Inference through Geo-Spatial Neural Reasoning)- a vision-language framework that emulates human spatial reasoning to infer accident coordinates directly from textual and map-based cues. ALIGN integrates large language and vision-language models within a multi-stage pipeline that performs optical character recognition, linguistic reasoning, and map-level verification through grid-based spatial scanning. The framework systematically evaluates each predicted location against contextual and visual evidence, ensuring interpretable, fine-grained geolocation outcomes without requiring model retraining. Applied to Bangla-language news data, ALIGN demonstrates consistent improvements over traditional geoparsing methods, accurately identifying district and sub-district-level crash sites. Beyond its technical contribution, the framework establishes a high accuracy foundation for automated crash mapping in data-scarce regions, supporting evidence-driven road-safety policymaking and the broader integration of multimodal artificial intelligence in transportation analytics. The code for this paper is open-source and available at: https://github.com/Thamed-Chowdhury/ALIGN
Related papers
- Vision-Language Feature Alignment for Road Anomaly Segmentation [38.2615882515309]
We propose a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs)<n>Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories.<n>At inference time, we introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence.
arXiv Detail & Related papers (2026-03-01T10:17:00Z) - OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents [68.85365034738534]
We introduce a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces.<n>The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions.<n>The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split.
arXiv Detail & Related papers (2026-02-19T18:59:54Z) - Large Language Models for Geolocation Extraction in Humanitarian Crisis Response [0.0]
This paper investigates whether Large Language Models can address geographic disparities in extracting location information from humanitarian documents.<n>We introduce a two-step framework that combines few-shot LLM-based named entity recognition with an agent-based geocoding module.<n>Results show that LLM-based methods substantially improve both the precision and fairness of geolocation extraction from humanitarian texts.
arXiv Detail & Related papers (2026-02-09T16:34:25Z) - Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach [41.001581773172695]
We present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates.<n>We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision.<n>Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference.
arXiv Detail & Related papers (2026-01-01T16:51:41Z) - GeoSense-AI: Fast Location Inference from Crisis Microblogs [0.0]
GeoSense-AI is an applied AI pipeline for realtime geolocation from noisy microblog streams.<n>System attains strong F1 while being engineered for orders-of-latency faster throughput.
arXiv Detail & Related papers (2025-12-20T05:46:57Z) - Empowering LLM Agents with Geospatial Awareness: Toward Grounded Reasoning for Wildfire Response [9.801192259936888]
Existing statistical approaches often lack semantic context, generalize poorly across events, and offer limited interpretability.<n>We introduce a Geospatial Awareness Layer (GAL) that grounds LLM agents in structured earth data.<n>GAL automatically retrieves and integrates infrastructure, demographic, terrain, and weather information from external geo databases.<n>This enriched context enables agents to produce evidence-based resource-allocation recommendations.
arXiv Detail & Related papers (2025-10-14T01:59:02Z) - GRE Suite: Geo-localization Inference via Fine-Tuned Vision-Language Models and Enhanced Reasoning Chains [20.788130896943663]
Geo Reason Enhancement (GRE) Suite is a novel framework that augments Visual Language Models with structured reasoning chains for interpretable location inference.<n>First, we introduce GRE30K, a high-quality geo-localization reasoning dataset designed to facilitate fine-grained visual and contextual analysis.<n>Next, we present the GRE model, which employs a multi-stage reasoning strategy to progressively infer scene attributes, local details, and semantic features, thereby narrowing down potential geographic regions with enhanced precision.
arXiv Detail & Related papers (2025-05-24T13:48:57Z) - OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence [51.0456395687016]
multimodal large language models (LLMs) have opened new frontiers in artificial intelligence.<n>We propose a MLLM (OmniGeo) tailored to geospatial applications.<n>By combining the strengths of natural language understanding and spatial reasoning, our model enhances the ability of instruction following and the accuracy of GeoAI systems.
arXiv Detail & Related papers (2025-03-20T16:45:48Z) - Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework.
By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information.
Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z) - Into the Unknown: Generating Geospatial Descriptions for New Environments [18.736071151303726]
Rendezvous task requires reasoning over allocentric spatial relationships.
Using opensource descriptions paired with coordinates (e.g., Wikipedia) provides training data but suffers from limited spatially-oriented text.
We propose a large-scale augmentation method for generating high-quality synthetic data for new environments.
arXiv Detail & Related papers (2024-06-28T14:56:21Z) - Mapping High-level Semantic Regions in Indoor Environments without
Object Recognition [50.624970503498226]
The present work proposes a method for semantic region mapping via embodied navigation in indoor environments.
To enable region identification, the method uses a vision-to-language model to provide scene information for mapping.
By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location.
arXiv Detail & Related papers (2024-03-11T18:09:50Z) - Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching [60.645802236700035]
Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets.
We introduce GeoText-1652, a new natural language-guided geo-localization benchmark.
This dataset is systematically constructed through an interactive human-computer process.
arXiv Detail & Related papers (2023-11-21T17:52:30Z) - GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark [56.08664336835741]
We propose a GeoGraphic Language Understanding Evaluation benchmark, named GeoGLUE.
We collect data from open-released geographic resources and introduce six natural language understanding tasks.
We pro vide evaluation experiments and analysis of general baselines, indicating the effectiveness and significance of the GeoGLUE benchmark.
arXiv Detail & Related papers (2023-05-11T03:21:56Z) - Geographic Adaptation of Pretrained Language Models [29.81557992080902]
We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup.
We show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the pretrained language models.
arXiv Detail & Related papers (2022-03-16T11:55:00Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.