CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning
- URL: http://arxiv.org/abs/2510.22282v1
- Date: Sat, 25 Oct 2025 12:56:46 GMT
- Title: CityRiSE: Reasoning Urban Socio-Economic Status in Vision-Language Models via Reinforcement Learning
- Authors: Tianhui Liu, Hetian Pang, Xin Zhang, Jie Feng, Yong Li, Pan Hui,
- Abstract summary: We introduce textbfCityRiSE, a novel framework for textbfReasontextbfing urban textbfSocio-textbfEconomic status in LVLMs through pure reinforcement learning (RL)<n>Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines.<n>This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.
- Score: 17.496690548652264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Harnessing publicly available, large-scale web data, such as street view and satellite imagery, urban socio-economic sensing is of paramount importance for achieving global sustainable development goals. With the emergence of Large Vision-Language Models (LVLMs), new opportunities have arisen to solve this task by treating it as a multi-modal perception and understanding problem. However, recent studies reveal that LVLMs still struggle with accurate and interpretable socio-economic predictions from visual data. To address these limitations and maximize the potential of LVLMs, we introduce \textbf{CityRiSE}, a novel framework for \textbf{R}eason\textbf{i}ng urban \textbf{S}ocio-\textbf{E}conomic status in LVLMs through pure reinforcement learning (RL). With carefully curated multi-modal data and verifiable reward design, our approach guides the LVLM to focus on semantically meaningful visual cues, enabling structured and goal-oriented reasoning for generalist socio-economic status prediction. Experiments demonstrate that CityRiSE with emergent reasoning process significantly outperforms existing baselines, improving both prediction accuracy and generalization across diverse urban contexts, particularly for prediction on unseen cities and unseen indicators. This work highlights the promise of combining RL and LVLMs for interpretable and generalist urban socio-economic sensing.
Related papers
- Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision [79.06371915084833]
We introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm.<n>Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content.<n>We extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions.
arXiv Detail & Related papers (2026-01-27T17:01:16Z) - Urban-R1: Reinforced MLLMs Mitigate Geospatial Biases for Urban General Intelligence [64.36291202666212]
Urban General Intelligence (UGI) refers to AI systems that can understand and reason about complex urban environments.<n>Recent studies have built urban foundation models using supervised fine-tuning (SFT) of LLMs and MLLMs.<n>We propose Urban-R1, a reinforcement learning-based post-training framework that aligns MLLMs with the objectives of UGI.
arXiv Detail & Related papers (2025-10-18T15:59:09Z) - Beyond Naïve Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs [57.82819770709032]
Large language models (LLMs) can be effective context-aided forecasters via na"ive direct prompting.<n>ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model's reasoning over the context.<n>CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines.<n> IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models.
arXiv Detail & Related papers (2025-08-13T16:02:55Z) - CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing [18.67492140450614]
CityLens is a benchmark designed to evaluate the capabilities of large language-vision models (LLVMs) in predicting socioeconomic indicators from satellite and street view imagery.<n>We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment.<n>Our results reveal that while LLVMs demonstrate promising perceptual and reasoning capabilities, they still exhibit limitations in predicting urban socioeconomic indicators.
arXiv Detail & Related papers (2025-05-31T12:25:33Z) - An Overview of Large Language Models for Statisticians [109.38601458831545]
Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI)<n>This paper explores potential areas where statisticians can make important contributions to the development of LLMs.<n>We focus on issues such as uncertainty quantification, interpretability, fairness, privacy, watermarking and model adaptation.
arXiv Detail & Related papers (2025-02-25T03:40:36Z) - MuseCL: Predicting Urban Socioeconomic Indicators via Multi-Semantic Contrastive Learning [13.681538916025021]
MuseCL is a framework for fine-grained urban region profiling and socioeconomic prediction.
We construct contrastive sample pairs for street view and remote sensing images, capitalizing on similarities in human mobility.
We extract semantic insights from POI texts embedded within these regions, employing a pre-trained text encoder.
arXiv Detail & Related papers (2024-06-23T09:49:41Z) - CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks [10.22654338686634]
Large language models (LLMs) and vision-language models (VLMs) have become essential to ensure their real-world effectiveness and reliability.<n>The challenge in constructing a systematic evaluation benchmark for urban research lies in the diversity of urban data.<n>In this paper, we design textitCityBench, an interactive simulator based evaluation platform.
arXiv Detail & Related papers (2024-06-20T02:25:07Z) - Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models [58.58594658683919]
Large multimodal models (LMMs) have shown transformative potential across various research tasks.
Our findings indicate LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs.
We propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
arXiv Detail & Related papers (2024-05-24T16:26:56Z) - UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction [26.693692853787756]
Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes.<n>Pretrained models, particularly those reliant on satellite imagery, face dual challenges.
arXiv Detail & Related papers (2024-03-25T14:57:18Z) - Sparsity-Guided Holistic Explanation for LLMs with Interpretable
Inference-Time Intervention [53.896974148579346]
Large Language Models (LLMs) have achieved unprecedented breakthroughs in various natural language processing domains.
The enigmatic black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications.
We propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs.
arXiv Detail & Related papers (2023-12-22T19:55:58Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z) - Knowledge-infused Contrastive Learning for Urban Imagery-based
Socioeconomic Prediction [13.26632316765164]
Urban imagery in web like satellite/street view images has emerged as an important source for socioeconomic prediction.
We propose a Knowledge-infused Contrastive Learning model for urban imagery-based socioeconomic prediction.
Our proposed KnowCL model can apply to both satellite and street imagery with both effectiveness and transferability achieved.
arXiv Detail & Related papers (2023-02-25T14:53:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.