Related papers: AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models

AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models

URL: http://arxiv.org/abs/2508.10667v1
Date: Thu, 14 Aug 2025 14:06:28 GMT
Title: AddressVLM: Cross-view Alignment Tuning for Image Address Localization using Large Vision-Language Models
Authors: Shixiong Xu, Chenghao Zhang, Lubin Fan, Yuan Zhou, Bin Fan, Shiming Xiang, Gaofeng Meng, Jieping Ye,
Abstract summary: Large visual language models (LVLMs) have demonstrated impressive performance in coarse-grained geo-localization at the country or city level.<n>They struggle with fine-grained street-level localization within urban areas.<n>In this paper, we explore integrating city-wide address localization capabilities into LVLMs, facilitating flexible address-related question answering using street-view images.
Score: 61.350774745321566
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large visual language models (LVLMs) have demonstrated impressive performance in coarse-grained geo-localization at the country or city level, but they struggle with fine-grained street-level localization within urban areas. In this paper, we explore integrating city-wide address localization capabilities into LVLMs, facilitating flexible address-related question answering using street-view images. A key challenge is that the street-view visual question-and-answer (VQA) data provides only microscopic visual cues, leading to subpar performance in fine-tuned models. To tackle this issue, we incorporate perspective-invariant satellite images as macro cues and propose cross-view alignment tuning including a satellite-view and street-view image grafting mechanism, along with an automatic label generation mechanism. Then LVLM's global understanding of street distribution is enhanced through cross-view matching. Our proposed model, named AddressVLM, consists of two-stage training protocols: cross-view alignment tuning and address localization tuning. Furthermore, we have constructed two street-view VQA datasets based on image address localization datasets from Pittsburgh and San Francisco. Qualitative and quantitative evaluations demonstrate that AddressVLM outperforms counterpart LVLMs by over 9% and 12% in average address localization accuracy on these two datasets, respectively.

Related papers

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning [65.36458157092207]
In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations.<n>We propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities.<n>We introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization.
arXiv Detail & Related papers (2026-02-12T08:53:32Z)
CLNet: Cross-View Correspondence Makes a Stronger Geo-Localizationer [48.52152634356309]
We propose a correspondence-aware feature refinement framework, termed CLNet, that explicitly bridges the semantic and geometric gaps between different views.<n> CLNet decomposes the view alignment process into three learnable and complementary modules.<n>Our proposed CLNet achieves state-of-the-art performance while offering better interpretability and generalizability.
arXiv Detail & Related papers (2025-12-16T16:31:41Z)
From Street to Orbit: Training-Free Cross-View Retrieval via Location Semantics and LLM Guidance [10.533095161205358]
Cross-view image retrieval is critical for applications such as autonomous navigation, urban planning, and localization in GPS-denied environments.<n>We present a simple yet effective cross-view image retrieval framework that leverages a pretrained vision encoder and a large language model (LLM)<n>Despite using no ground-truth supervision or finetuning, our proposed method outperforms prior learning-based approaches on the benchmark dataset under zero-shot settings.
arXiv Detail & Related papers (2025-11-12T23:51:46Z)
GLEAM: Learning to Match and Explain in Cross-View Geo-Localization [66.11208984986813]
Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location.<n>We present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery.<n>To address the lack of interpretability in traditional CVGL methods, we propose GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning.
arXiv Detail & Related papers (2025-09-09T07:14:31Z)
CoMemo: LVLMs Need Image Context with Image Memory [51.681858871027345]
CoMemo is a dual-path architecture that combines a Context image path with an image Memory path for visual processing.<n>We introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness.
arXiv Detail & Related papers (2025-06-06T17:59:06Z)
Visual Position Prompt for MLLM based Visual Grounding [29.34950670755899]
We introduce VPP-LLaVA, an MLLM enhanced with Visual Position Prompt to improve its grounding capability.<n>We also introduce VPP-SFT, a curated dataset of 0.6M high-quality visual grounding samples.<n>The resulting model achieves state-of-the-art results on standard visual grounding benchmarks.
arXiv Detail & Related papers (2025-03-19T17:08:13Z)
OSMLoc: Single Image-Based Visual Localization in OpenStreetMap with Fused Geometric and Semantic Guidance [11.085165252259042]
OSMLoc is a brain-inspired visual localization approach based on first-person-view images against the OpenStreetMap maps.<n>It integrates semantic and geometric guidance to significantly improve accuracy, robustness, and generalization capability.
arXiv Detail & Related papers (2024-11-13T14:59:00Z)
Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors.<n>Our goal is to resolve this with a vision backbone that effectively captures both local and global image semantics.<n>We propose a new efficient post-training stage for ViTs called locality alignment and a novel fine-tuning procedure called MaskEmbed.
arXiv Detail & Related papers (2024-10-14T21:01:01Z)
Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework. By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information. Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z)
AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics. We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z)
GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model [6.135404769437841]
This work tackles the problem of geo-localization with a new paradigm using a large vision-language model (LVLM) Existing street-view datasets often contain numerous low-quality images lacking visual clues, and lack any reasoning inference. To address the data-quality issue, we devise a CLIP-based network to quantify the degree of street-view images being locatable. To enhance reasoning inference, we integrate external knowledge obtained from real geo-localization games, tapping into valuable human inference capabilities.
arXiv Detail & Related papers (2024-06-03T18:08:56Z)
CVLNet: Cross-View Semantic Correspondence Learning for Video-based Camera Localization [89.69214577915959]
This paper tackles the problem of Cross-view Video-based camera localization. We propose estimating the query camera's relative displacement to a satellite image before similarity matching. Experiments have demonstrated the effectiveness of video-based localization over single image-based localization.
arXiv Detail & Related papers (2022-08-07T07:35:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.