Related papers: Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision

Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision

URL: http://arxiv.org/abs/2511.07062v1
Date: Mon, 10 Nov 2025 12:53:32 GMT
Title: Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision
Authors: Yimei Zhang, Guojiang Shen, Kaili Ning, Tongwei Ren, Xuebo Qiu, Mengmeng Wang, Xiangjie Kong,
Abstract summary: Region representation learning plays a pivotal role in urban computing by extracting meaningful features from unlabeled urban data.<n>Recent studies have explored leveraging Large Language Models (LLMs) to incorporate textual knowledge into imagery-based urban region representation learning.<n>We propose a novel pre-training framework called UrbanLN that improves Urban region representation learning through Long-text awareness and Noise suppression.
Score: 19.72633898920108
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Region representation learning plays a pivotal role in urban computing by extracting meaningful features from unlabeled urban data. Analogous to how perceived facial age reflects an individual's health, the visual appearance of a city serves as its ``portrait", encapsulating latent socio-economic and environmental characteristics. Recent studies have explored leveraging Large Language Models (LLMs) to incorporate textual knowledge into imagery-based urban region representation learning. However, two major challenges remain: i)~difficulty in aligning fine-grained visual features with long captions, and ii) suboptimal knowledge incorporation due to noise in LLM-generated captions. To address these issues, we propose a novel pre-training framework called UrbanLN that improves Urban region representation learning through Long-text awareness and Noise suppression. Specifically, we introduce an information-preserved stretching interpolation strategy that aligns long captions with fine-grained visual semantics in complex urban scenes. To effectively mine knowledge from LLM-generated captions and filter out noise, we propose a dual-level optimization strategy. At the data level, a multi-model collaboration pipeline automatically generates diverse and reliable captions without human intervention. At the model level, we employ a momentum-based self-distillation mechanism to generate stable pseudo-targets, facilitating robust cross-modal learning under noisy conditions. Extensive experiments across four real-world cities and various downstream tasks demonstrate the superior performance of our UrbanLN.

Related papers

GLip: A Global-Local Integrated Progressive Framework for Robust Visual Speech Recognition [72.29071664964633]
We propose GLip, a Global-Local Integrated Progressive framework designed for robust visual speech recognition (VSR)<n>GLip learns to align both global and local visual features with corresponding acoustic speech units using easily accessible audio-visual data.<n>In the second stage, we introduce a Contextual Enhancement Module (CEM) to dynamically integrate local features with relevant global context.
arXiv Detail & Related papers (2025-09-19T14:36:01Z)
HiLa: Hierarchical Vision-Language Collaboration for Cancer Survival Prediction [55.00788339683146]
We propose a novel Hierarchical vision-Language collaboration framework for improved survival prediction.<n> Specifically, HiLa employs pretrained feature extractors to generate hierarchical visual features from WSIs at both patch and region levels.<n>This ap-proach enables the comprehensive learning of discriminative visual features cor-responding to different survival-related attributes from prompts.
arXiv Detail & Related papers (2025-07-07T02:06:25Z)
Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes [0.9208007322096533]
This paper introduces SAGAI: Streetscape Analysis with Generative Artificial Intelligence.<n>It is a modular workflow for scoring street-level urban scenes using open-access data and vision-language models.<n>It operates without task-specific training or proprietary software dependencies.
arXiv Detail & Related papers (2025-04-23T09:08:06Z)
MobiCLR: Mobility Time Series Contrastive Learning for Urban Region Representations [18.07010464073212]
We propose a novel urban region representation learning model, which captures semantically meaningful embeddings from inflow and outflow mobility patterns.<n>We conduct experiments in Chicago, New York, and Washington, D.C. to predict income, educational attainment, and social vulnerability.
arXiv Detail & Related papers (2025-02-05T06:18:43Z)
FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity [68.15983300711355]
FineCAPTION is a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different levels. We introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning.
arXiv Detail & Related papers (2024-11-23T02:20:32Z)
Urban Region Pre-training and Prompting: A Graph-based Approach [10.375941950028938]
We propose a $textbfG$raph-based $textbfU$rban $textbfR$egion $textbfP$re-training and $textbfP$rompting framework for region representation learning.
arXiv Detail & Related papers (2024-08-12T05:00:23Z)
AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics. We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z)
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction [26.693692853787756]
Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes.<n>Pretrained models, particularly those reliant on satellite imagery, face dual challenges.
arXiv Detail & Related papers (2024-03-25T14:57:18Z)
UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web [37.332601383723585]
This paper introduces the first-ever framework that integrates the knowledge of textual modality into urban imagery profiling. It generates a detailed textual description for each satellite image by an open-source Image-to-Text LLM. The model is trained on the image-text pairs, seamlessly unifying natural language supervision for urban visual representation learning.
arXiv Detail & Related papers (2023-10-22T02:32:53Z)
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.