UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling
- URL: http://arxiv.org/abs/2403.16831v2
- Date: Wed, 29 May 2024 06:11:30 GMT
- Title: UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling
- Authors: Xixuan Hao, Wei Chen, Yibo Yan, Siru Zhong, Kun Wang, Qingsong Wen, Yuxuan Liang,
- Abstract summary: Urban region profiling aims to learn a low-dimensional representation of a given urban area.
pretrained models, particularly those reliant on satellite imagery, face dual challenges.
concentrating solely on macro-level patterns from satellite data may introduce bias.
The lack of interpretability in pretrained models limits their utility in providing transparent evidence for urban planning.
- Score: 26.693692853787756
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Urban region profiling aims to learn a low-dimensional representation of a given urban area while preserving its characteristics, such as demographics, infrastructure, and economic activities, for urban planning and development. However, prevalent pretrained models, particularly those reliant on satellite imagery, face dual challenges. Firstly, concentrating solely on macro-level patterns from satellite data may introduce bias, lacking nuanced details at micro levels, such as architectural details at a place.Secondly, the lack of interpretability in pretrained models limits their utility in providing transparent evidence for urban planning. In response to these issues, we devise a novel framework entitled UrbanVLP based on Vision-Language Pretraining. Our UrbanVLP seamlessly integrates multi-granularity information from both macro (satellite) and micro (street-view) levels, overcoming the limitations of prior pretrained models. Moreover, it introduces automatic text generation and calibration, elevating interpretability in downstream applications by producing high-quality text descriptions of urban imagery. Rigorous experiments conducted across six urban indicator prediction tasks underscore its superior performance.
Related papers
- StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model [12.789465279993864]
Geospatial predictions are crucial for diverse fields such as disaster management, urban planning, and public health.
We propose StreetViewLLM, a novel framework that integrates a large language model with the chain-of-thought reasoning and multimodal data sources.
The model has been applied to seven global cities, including Hong Kong, Tokyo, Singapore, Los Angeles, New York, London, and Paris.
arXiv Detail & Related papers (2024-11-19T05:15:19Z) - Multimodal Contrastive Learning of Urban Space Representations from POI Data [2.695321027513952]
CaLLiPer (Contrastive Language-Location Pre-training) is a representation learning model that embeds continuous urban spaces into vector representations.
We validate CaLLiPer's effectiveness by applying it to learning urban space representations in London, UK.
arXiv Detail & Related papers (2024-11-09T16:24:07Z) - StreetSurfGS: Scalable Urban Street Surface Reconstruction with Planar-based Gaussian Splatting [85.67616000086232]
StreetSurfGS is first method to employ Gaussian Splatting specifically tailored for scalable urban street scene surface reconstruction.
StreetSurfGS utilizes a planar-based octree representation and segmented training to reduce memory costs, accommodate unique camera characteristics, and ensure scalability.
To address sparse views and multi-scale challenges, we use a dual-step matching strategy that leverages adjacent and long-term information.
arXiv Detail & Related papers (2024-10-06T04:21:59Z) - MuseCL: Predicting Urban Socioeconomic Indicators via Multi-Semantic Contrastive Learning [13.681538916025021]
MuseCL is a framework for fine-grained urban region profiling and socioeconomic prediction.
We construct contrastive sample pairs for street view and remote sensing images, capitalizing on similarities in human mobility.
We extract semantic insights from POI texts embedded within these regions, employing a pre-trained text encoder.
arXiv Detail & Related papers (2024-06-23T09:49:41Z) - CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks [10.22654338686634]
Large language models (LLMs) with extensive general knowledge and powerful reasoning abilities have seen rapid development and widespread application.
In this paper, we design CityBench, an interactive simulator based evaluation platform.
We design 8 representative urban tasks in 2 categories of perception-understanding and decision-making as the CityBench.
arXiv Detail & Related papers (2024-06-20T02:25:07Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - UV-SAM: Adapting Segment Anything Model for Urban Village Identification [25.286722125746902]
Governments heavily depend on field survey methods to monitor the urban villages.
To accurately identify urban village boundaries from satellite images, we adapt the Segment Anything Model (SAM) to urban village segmentation, named UV-SAM.
UV-SAM first leverages a small-sized semantic segmentation model to produce mixed prompts for urban villages, including mask, bounding box, and image representations, which are then fed into SAM for fine-grained boundary identification.
arXiv Detail & Related papers (2024-01-16T03:21:42Z) - UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web [37.332601383723585]
This paper introduces the first-ever framework that integrates the knowledge of textual modality into urban imagery profiling.
It generates a detailed textual description for each satellite image by an open-source Image-to-Text LLM.
The model is trained on the image-text pairs, seamlessly unifying natural language supervision for urban visual representation learning.
arXiv Detail & Related papers (2023-10-22T02:32:53Z) - Dual-stage Flows-based Generative Modeling for Traceable Urban Planning [33.03616838528995]
We propose a novel generative framework based on normalizing flows, namely Dual-stage Urban Flows framework.
We employ an Information Fusion Module to capture the relationship among functional zones and fuse the information of different aspects.
Our framework can outperform compared to other generative models for the urban planning task.
arXiv Detail & Related papers (2023-10-03T21:49:49Z) - Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs)
We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing.
We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z) - Unified Data Management and Comprehensive Performance Evaluation for
Urban Spatial-Temporal Prediction [Experiment, Analysis & Benchmark] [78.05103666987655]
This work addresses challenges in accessing and utilizing diverse urban spatial-temporal datasets.
We introduceatomic files, a unified storage format designed for urban spatial-temporal big data, and validate its effectiveness on 40 diverse datasets.
We conduct extensive experiments using diverse models and datasets, establishing a performance leaderboard and identifying promising research directions.
arXiv Detail & Related papers (2023-08-24T16:20:00Z) - Bilevel Generative Learning for Low-Light Vision [64.77933848939327]
We propose a generic low-light vision solution by introducing a generative block to convert data from the RAW to the RGB domain.
This novel approach connects diverse vision problems by explicitly depicting data generation, which is the first in the field.
We develop two types of learning strategies targeting different goals, namely low cost and high accuracy, to acquire a new bilevel generative learning paradigm.
arXiv Detail & Related papers (2023-08-07T07:59:56Z) - UrbanBIS: a Large-scale Benchmark for Fine-grained Urban Building
Instance Segmentation [50.52615875873055]
UrbanBIS comprises six real urban scenes, with 2.5 billion points, covering a vast area of 10.78 square kilometers.
UrbanBIS provides semantic-level annotations on a rich set of urban objects, including buildings, vehicles, vegetation, roads, and bridges.
UrbanBIS is the first 3D dataset that introduces fine-grained building sub-categories.
arXiv Detail & Related papers (2023-05-04T08:01:38Z) - Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video
Quality Assessment [54.31355080688127]
We introduce a text-prompted Semantic Affinity Quality Index (SAQI) and its localized version (SAQI-Local) using Contrastive Language-Image Pre-training (CLIP)
BVQI-Local demonstrates unprecedented performance, surpassing existing zero-shot indices by at least 24% on all datasets.
We conduct comprehensive analyses to investigate different quality concerns of distinct indices, demonstrating the effectiveness and rationality of our design.
arXiv Detail & Related papers (2023-04-28T08:06:05Z) - Knowledge-infused Contrastive Learning for Urban Imagery-based
Socioeconomic Prediction [13.26632316765164]
Urban imagery in web like satellite/street view images has emerged as an important source for socioeconomic prediction.
We propose a Knowledge-infused Contrastive Learning model for urban imagery-based socioeconomic prediction.
Our proposed KnowCL model can apply to both satellite and street imagery with both effectiveness and transferability achieved.
arXiv Detail & Related papers (2023-02-25T14:53:17Z) - A Contextual Master-Slave Framework on Urban Region Graph for Urban
Village Detection [68.84486900183853]
We build an urban region graph (URG) to model the urban area in a hierarchically structured way.
Then, we design a novel contextual master-slave framework to effectively detect the urban village from the URG.
The proposed framework can learn to balance the generality and specificity for UV detection in an urban area.
arXiv Detail & Related papers (2022-11-26T18:17:39Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - SimVLM: Simple Visual Language Model Pretraining with Weak Supervision [48.98275876458666]
We present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM)
SimVLM reduces the training complexity by exploiting large-scale weak supervision.
It achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks.
arXiv Detail & Related papers (2021-08-24T18:14:00Z) - Methodological Foundation of a Numerical Taxonomy of Urban Form [62.997667081978825]
We present a method for numerical taxonomy of urban form derived from biological systematics.
We derive homogeneous urban tissue types and, by determining overall morphological similarity between them, generate a hierarchical classification of urban form.
After framing and presenting the method, we test it on two cities - Prague and Amsterdam.
arXiv Detail & Related papers (2021-04-30T12:47:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.