UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web
- URL: http://arxiv.org/abs/2310.18340v2
- Date: Sun, 24 Mar 2024 09:09:00 GMT
- Title: UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web
- Authors: Yibo Yan, Haomin Wen, Siru Zhong, Wei Chen, Haodong Chen, Qingsong Wen, Roger Zimmermann, Yuxuan Liang,
- Abstract summary: This paper introduces the first-ever framework that integrates the knowledge of textual modality into urban imagery profiling.
It generates a detailed textual description for each satellite image by an open-source Image-to-Text LLM.
The model is trained on the image-text pairs, seamlessly unifying natural language supervision for urban visual representation learning.
- Score: 37.332601383723585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Urban region profiling from web-sourced data is of utmost importance for urban planning and sustainable development. We are witnessing a rising trend of LLMs for various fields, especially dealing with multi-modal data research such as vision-language learning, where the text modality serves as a supplement information for the image. Since textual modality has never been introduced into modality combinations in urban region profiling, we aim to answer two fundamental questions in this paper: i) Can textual modality enhance urban region profiling? ii) and if so, in what ways and with regard to which aspects? To answer the questions, we leverage the power of Large Language Models (LLMs) and introduce the first-ever LLM-enhanced framework that integrates the knowledge of textual modality into urban imagery profiling, named LLM-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining (UrbanCLIP). Specifically, it first generates a detailed textual description for each satellite image by an open-source Image-to-Text LLM. Then, the model is trained on the image-text pairs, seamlessly unifying natural language supervision for urban visual representation learning, jointly with contrastive loss and language modeling loss. Results on predicting three urban indicators in four major Chinese metropolises demonstrate its superior performance, with an average improvement of 6.1% on R^2 compared to the state-of-the-art methods. Our code and the image-language dataset will be released upon paper notification.
Related papers
- FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity [68.15983300711355]
FineCAPTION is a novel VLM that can recognize arbitrary masks as referential inputs and process high-resolution images for compositional image captioning at different levels.
We introduce COMPOSITIONCAP, a new dataset for multi-grained region compositional image captioning, which introduces the task of compositional attribute-aware regional image captioning.
arXiv Detail & Related papers (2024-11-23T02:20:32Z) - StreetviewLLM: Extracting Geographic Information Using a Chain-of-Thought Multimodal Large Language Model [12.789465279993864]
Geospatial predictions are crucial for diverse fields such as disaster management, urban planning, and public health.
We propose StreetViewLLM, a novel framework that integrates a large language model with the chain-of-thought reasoning and multimodal data sources.
The model has been applied to seven global cities, including Hong Kong, Tokyo, Singapore, Los Angeles, New York, London, and Paris.
arXiv Detail & Related papers (2024-11-19T05:15:19Z) - AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - mTREE: Multi-Level Text-Guided Representation End-to-End Learning for Whole Slide Image Analysis [16.472295458683696]
Multi-modal learning adeptly integrates visual and textual data, but its application to histopathology image and text analysis remains challenging.
We introduce Multi-Level Text-Guided Representation End-to-End Learning (mTREE)
This novel text-guided approach effectively captures multi-scale Whole Slide Images (WSIs) by utilizing information from accompanying textual pathology information.
arXiv Detail & Related papers (2024-05-28T04:47:44Z) - TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling [26.693692853787756]
Urban region profiling aims to learn a low-dimensional representation of a given urban area.
pretrained models, particularly those reliant on satellite imagery, face dual challenges.
concentrating solely on macro-level patterns from satellite data may introduce bias.
The lack of interpretability in pretrained models limits their utility in providing transparent evidence for urban planning.
arXiv Detail & Related papers (2024-03-25T14:57:18Z) - InternLM-XComposer: A Vision-Language Large Model for Advanced
Text-image Comprehension and Composition [111.65584066987036]
InternLM-XComposer is a vision-language large model that enables advanced image-text comprehension and composition.
It can effortlessly generate coherent and contextual articles that seamlessly integrate images.
It can intelligently identify the areas in the text where images would enhance the content and automatically insert the most appropriate visual candidates.
arXiv Detail & Related papers (2023-09-26T17:58:20Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.