UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling
- URL: http://arxiv.org/abs/2403.16831v2
- Date: Wed, 29 May 2024 06:11:30 GMT
- Title: UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling
- Authors: Xixuan Hao, Wei Chen, Yibo Yan, Siru Zhong, Kun Wang, Qingsong Wen, Yuxuan Liang,
- Abstract summary: Urban region profiling aims to learn a low-dimensional representation of a given urban area.
pretrained models, particularly those reliant on satellite imagery, face dual challenges.
concentrating solely on macro-level patterns from satellite data may introduce bias.
The lack of interpretability in pretrained models limits their utility in providing transparent evidence for urban planning.
- Score: 26.693692853787756
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Urban region profiling aims to learn a low-dimensional representation of a given urban area while preserving its characteristics, such as demographics, infrastructure, and economic activities, for urban planning and development. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place.Secondly, the lack of interpretability in pretrained models limits their utility in providing transparent evidence for urban planning. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, elevating interpretability in downstream applications by producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six urban indicator prediction tasks underscore its superior performance.
Related papers
- StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model [12.789465279993864]
Geospatial predictions are crucial for diverse fields such as disaster management, urban planning, and public health.
We propose StreetViewLLM, a novel framework that integrates a large language model with the chain-of-thought reasoning and multimodal data sources.
The model has been applied to seven global cities, including Hong Kong, Tokyo, Singapore, Los Angeles, New York, London, and Paris.
arXiv Detail & Related papers (2024-11-19T05:15:19Z) - Multimodal Contrastive Learning of Urban Space Representations from POI Data [2.695321027513952]
CaLLiPer (Contrastive Language-Location Pre-training) is a representation learning model that embeds continuous urban spaces into vector representations.
We validate CaLLiPer's effectiveness by applying it to learning urban space representations in London, UK.
arXiv Detail & Related papers (2024-11-09T16:24:07Z) - StreetSurfGS: Scalable Urban Street Surface Reconstruction with Planar-based Gaussian Splatting [85.67616000086232]
StreetSurfGS is first method to employ Gaussian Splatting specifically tailored for scalable urban street scene surface reconstruction.
StreetSurfGS utilizes a planar-based octree representation and segmented training to reduce memory costs, accommodate unique camera characteristics, and ensure scalability.
To address sparse views and multi-scale challenges, we use a dual-step matching strategy that leverages adjacent and long-term information.
arXiv Detail & Related papers (2024-10-06T04:21:59Z) - UV-SAM: Adapting Segment Anything Model for Urban Village Identification [25.286722125746902]
Governments heavily depend on field survey methods to monitor the urban villages.
To accurately identify urban village boundaries from satellite images, we adapt the Segment Anything Model (SAM) to urban village segmentation, named UV-SAM.
UV-SAM first leverages a small-sized semantic segmentation model to produce mixed prompts for urban villages, including mask, bounding box, and image representations, which are then fed into SAM for fine-grained boundary identification.
arXiv Detail & Related papers (2024-01-16T03:21:42Z) - UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web [37.332601383723585]
This paper introduces the first-ever framework that integrates the knowledge of textual modality into urban imagery profiling.
It generates a detailed textual description for each satellite image by an open-source Image-to-Text LLM.
The model is trained on the image-text pairs, seamlessly unifying natural language supervision for urban visual representation learning.
arXiv Detail & Related papers (2023-10-22T02:32:53Z) - Dual-stage Flows-based Generative Modeling for Traceable Urban Planning [33.03616838528995]
We propose a novel generative framework based on normalizing flows, namely Dual-stage Urban Flows framework.
We employ an Information Fusion Module to capture the relationship among functional zones and fuse the information of different aspects.
Our framework can outperform compared to other generative models for the urban planning task.
arXiv Detail & Related papers (2023-10-03T21:49:49Z) - Unified Data Management and Comprehensive Performance Evaluation for
Urban Spatial-Temporal Prediction [Experiment, Analysis & Benchmark] [78.05103666987655]
This work addresses challenges in accessing and utilizing diverse urban spatial-temporal datasets.
We introduceatomic files, a unified storage format designed for urban spatial-temporal big data, and validate its effectiveness on 40 diverse datasets.
We conduct extensive experiments using diverse models and datasets, establishing a performance leaderboard and identifying promising research directions.
arXiv Detail & Related papers (2023-08-24T16:20:00Z) - UrbanBIS: a Large-scale Benchmark for Fine-grained Urban Building
Instance Segmentation [50.52615875873055]
UrbanBIS comprises six real urban scenes, with 2.5 billion points, covering a vast area of 10.78 square kilometers.
UrbanBIS provides semantic-level annotations on a rich set of urban objects, including buildings, vehicles, vegetation, roads, and bridges.
UrbanBIS is the first 3D dataset that introduces fine-grained building sub-categories.
arXiv Detail & Related papers (2023-05-04T08:01:38Z) - A Contextual Master-Slave Framework on Urban Region Graph for Urban
Village Detection [68.84486900183853]
We build an urban region graph (URG) to model the urban area in a hierarchically structured way.
Then, we design a novel contextual master-slave framework to effectively detect the urban village from the URG.
The proposed framework can learn to balance the generality and specificity for UV detection in an urban area.
arXiv Detail & Related papers (2022-11-26T18:17:39Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z) - Methodological Foundation of a Numerical Taxonomy of Urban Form [62.997667081978825]
We present a method for numerical taxonomy of urban form derived from biological systematics.
We derive homogeneous urban tissue types and, by determining overall morphological similarity between them, generate a hierarchical classification of urban form.
After framing and presenting the method, we test it on two cities - Prague and Amsterdam.
arXiv Detail & Related papers (2021-04-30T12:47:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.