Related papers: CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing

CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing

URL: http://arxiv.org/abs/2506.00530v1
Date: Sat, 31 May 2025 12:25:33 GMT
Title: CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing
Authors: Tianhui Liu, Jie Feng, Hetian Pang, Xin Zhang, Tianjian Ouyang, Zhiyuan Zhang, Yong Li,
Abstract summary: CityLens is a benchmark designed to evaluate the capabilities of large language-vision models (LLVMs) in predicting socioeconomic indicators from satellite and street view imagery.<n>We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment.<n>Our results reveal that while LLVMs demonstrate promising perceptual and reasoning capabilities, they still exhibit limitations in predicting urban socioeconomic indicators.
Score: 18.67492140450614
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding urban socioeconomic conditions through visual data is a challenging yet essential task for sustainable urban development and policy planning. In this work, we introduce $\textbf{CityLens}$, a comprehensive benchmark designed to evaluate the capabilities of large language-vision models (LLVMs) in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize three evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LLVMs across these tasks. Our results reveal that while LLVMs demonstrate promising perceptual and reasoning capabilities, they still exhibit limitations in predicting urban socioeconomic indicators. CityLens provides a unified framework for diagnosing these limitations and guiding future efforts in using LLVMs to understand and predict urban socioeconomic patterns. Our codes and datasets are open-sourced via https://github.com/tsinghua-fib-lab/CityLens.

Related papers

Urban Forms Across Continents: A Data-Driven Comparison of Lausanne and Philadelphia [7.693465097015469]
This study presents a data-driven framework to identify and compare urban typologies across geographically and culturally distinct cities.<n>We extracted multidimensional features related to topography, multimodality, green spaces, and points of interest for the cities of Lausanne, Switzerland, and Philadelphia, USA.<n>The results reveal coherent and interpretable urban typologies within each city, with some cluster types emerging across both cities despite their differences in scale, density, and cultural context.
arXiv Detail & Related papers (2025-05-05T18:13:22Z)
Collaborative Imputation of Urban Time Series through Cross-city Meta-learning [54.438991949772145]
We propose a novel collaborative imputation paradigm leveraging meta-learned implicit neural representations (INRs)<n>We then introduce a cross-city collaborative learning scheme through model-agnostic meta learning.<n>Experiments on a diverse urban dataset from 20 global cities demonstrate our model's superior imputation performance and generalizability.
arXiv Detail & Related papers (2025-01-20T07:12:40Z)
StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model [12.789465279993864]
Geospatial predictions are crucial for diverse fields such as disaster management, urban planning, and public health. We propose StreetViewLLM, a novel framework that integrates a large language model with the chain-of-thought reasoning and multimodal data sources. The model has been applied to seven global cities, including Hong Kong, Tokyo, Singapore, Los Angeles, New York, London, and Paris.
arXiv Detail & Related papers (2024-11-19T05:15:19Z)
Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models [85.55649666025926]
We introduce Can-Do, a benchmark dataset designed to evaluate embodied planning abilities. Our dataset includes 400 multimodal samples, each consisting of natural language user instructions, visual images depicting the environment, state changes, and corresponding action plans. We propose NeuroGround, a neurosymbolic framework that first grounds the plan generation in the perceived environment states and then leverages symbolic planning engines to augment the model-generated plans.
arXiv Detail & Related papers (2024-09-22T00:30:11Z)
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios [60.492736455572015]
We present UrBench, a benchmark designed for evaluating LMMs in complex multi-view urban scenarios.<n>UrBench contains 11.6K meticulously curated questions at both region-level and role-level.<n>Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects.
arXiv Detail & Related papers (2024-08-30T13:13:35Z)
MuseCL: Predicting Urban Socioeconomic Indicators via Multi-Semantic Contrastive Learning [13.681538916025021]
MuseCL is a framework for fine-grained urban region profiling and socioeconomic prediction. We construct contrastive sample pairs for street view and remote sensing images, capitalizing on similarities in human mobility. We extract semantic insights from POI texts embedded within these regions, employing a pre-trained text encoder.
arXiv Detail & Related papers (2024-06-23T09:49:41Z)
CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks [10.22654338686634]
Large language models (LLMs) and vision-language models (VLMs) have become essential to ensure their real-world effectiveness and reliability.<n>The challenge in constructing a systematic evaluation benchmark for urban research lies in the diversity of urban data.<n>In this paper, we design textitCityBench, an interactive simulator based evaluation platform.
arXiv Detail & Related papers (2024-06-20T02:25:07Z)
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction [26.693692853787756]
Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes.<n>Pretrained models, particularly those reliant on satellite imagery, face dual challenges.
arXiv Detail & Related papers (2024-03-25T14:57:18Z)
Unified Data Management and Comprehensive Performance Evaluation for Urban Spatial-Temporal Prediction [Experiment, Analysis & Benchmark] [78.05103666987655]
This work addresses challenges in accessing and utilizing diverse urban spatial-temporal datasets. We introduceatomic files, a unified storage format designed for urban spatial-temporal big data, and validate its effectiveness on 40 diverse datasets. We conduct extensive experiments using diverse models and datasets, establishing a performance leaderboard and identifying promising research directions.
arXiv Detail & Related papers (2023-08-24T16:20:00Z)
Conditioned Human Trajectory Prediction using Iterative Attention Blocks [70.36888514074022]
We present a simple yet effective pedestrian trajectory prediction model aimed at pedestrians positions prediction in urban-like environments. Our model is a neural-based architecture that can run several layers of attention blocks and transformers in an iterative sequential fashion. We show that without explicit introduction of social masks, dynamical models, social pooling layers, or complicated graph-like structures, it is possible to produce on par results with SoTA models.
arXiv Detail & Related papers (2022-06-29T07:49:48Z)
Methodological Foundation of a Numerical Taxonomy of Urban Form [62.997667081978825]
We present a method for numerical taxonomy of urban form derived from biological systematics. We derive homogeneous urban tissue types and, by determining overall morphological similarity between them, generate a hierarchical classification of urban form. After framing and presenting the method, we test it on two cities - Prague and Amsterdam.
arXiv Detail & Related papers (2021-04-30T12:47:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.