Urban2Vec: Incorporating Street View Imagery and POIs for Multi-Modal
Urban Neighborhood Embedding
- URL: http://arxiv.org/abs/2001.11101v1
- Date: Wed, 29 Jan 2020 21:30:53 GMT
- Title: Urban2Vec: Incorporating Street View Imagery and POIs for Multi-Modal
Urban Neighborhood Embedding
- Authors: Zhecheng Wang, Haoyuan Li, Ram Rajagopal
- Abstract summary: Urban2Vec is an unsupervised multi-modal framework which incorporates both street view imagery and point-of-interest data.
We show that Urban2Vec can achieve performances better than baseline models and comparable to fully-supervised methods in downstream prediction tasks.
- Score: 8.396746290518102
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding intrinsic patterns and predicting spatiotemporal
characteristics of cities require a comprehensive representation of urban
neighborhoods. Existing works relied on either inter- or intra-region
connectivities to generate neighborhood representations but failed to fully
utilize the informative yet heterogeneous data within neighborhoods. In this
work, we propose Urban2Vec, an unsupervised multi-modal framework which
incorporates both street view imagery and point-of-interest (POI) data to learn
neighborhood embeddings. Specifically, we use a convolutional neural network to
extract visual features from street view images while preserving geospatial
similarity. Furthermore, we model each POI as a bag-of-words containing its
category, rating, and review information. Analog to document embedding in
natural language processing, we establish the semantic similarity between
neighborhood ("document") and the words from its surrounding POIs in the vector
space. By jointly encoding visual, textual, and geospatial information into the
neighborhood representation, Urban2Vec can achieve performances better than
baseline models and comparable to fully-supervised methods in downstream
prediction tasks. Extensive experiments on three U.S. metropolitan areas also
demonstrate the model interpretability, generalization capability, and its
value in neighborhood similarity analysis.
Related papers
- AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - Urban Scene Diffusion through Semantic Occupancy Map [49.20779809250597]
UrbanDiffusion is a 3D diffusion model conditioned on a Bird's-Eye View (BEV) map.
Our model learns the data distribution of scene-level structures within a latent space.
After training on real-world driving datasets, our model can generate a wide range of diverse urban scenes.
arXiv Detail & Related papers (2024-03-18T11:54:35Z) - Urban Region Embedding via Multi-View Contrastive Prediction [22.164358462563996]
We form a new pipeline to learn consistent representations across varying views.
Our model outperforms state-of-the-art baseline methods significantly in urban region representation learning.
arXiv Detail & Related papers (2023-12-15T10:53:09Z) - Cross-City Matters: A Multimodal Remote Sensing Benchmark Dataset for
Cross-City Semantic Segmentation using High-Resolution Domain Adaptation
Networks [82.82866901799565]
We build a new set of multimodal remote sensing benchmark datasets (including hyperspectral, multispectral, SAR) for the study purpose of the cross-city semantic segmentation task.
Beyond the single city, we propose a high-resolution domain adaptation network, HighDAN, to promote the AI model's generalization ability from the multi-city environments.
HighDAN is capable of retaining the spatially topological structure of the studied urban scene well in a parallel high-to-low resolution fusion fashion.
arXiv Detail & Related papers (2023-09-26T23:55:39Z) - Street-View Image Generation from a Bird's-Eye View Layout [95.36869800896335]
Bird's-Eye View (BEV) Perception has received increasing attention in recent years.
Data-driven simulation for autonomous driving has been a focal point of recent research.
We propose BEVGen, a conditional generative model that synthesizes realistic and spatially consistent surrounding images.
arXiv Detail & Related papers (2023-01-11T18:39:34Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Effective Urban Region Representation Learning Using Heterogeneous Urban
Graph Attention Network (HUGAT) [0.0]
We propose heterogeneous urban graph attention network (HUGAT) for learning the representations of urban regions.
In our experiments on NYC data, HUGAT outperformed all the state-of-the-art models.
arXiv Detail & Related papers (2022-02-18T04:59:20Z) - Hex2vec -- Context-Aware Embedding H3 Hexagons with OpenStreetMap Tags [9.743315439284407]
We propose the first approach to learning vector representations of regions with respect to urban functions and land-use in a micro-region grid.
We identify a subset of OpenStreetMap tags related to major characteristics of land-use, building and urban region functions, types of water, green or other natural areas.
The resulting vector representations showcase semantic structures of the map characteristics, similar to ones found in vector-based language models.
arXiv Detail & Related papers (2021-11-01T14:22:53Z) - Learning Neighborhood Representation from Multi-Modal Multi-Graph:
Image, Text, Mobility Graph and Beyond [20.014906526266795]
We propose a novel approach to integrate multi-modal geotagged inputs as either node or edge features of a multi-graph.
Specifically, we use street view images and POI features to characterize neighborhoods (nodes) and use human mobility to characterize the relationship between neighborhoods (directed edges)
The embedding we trained outperforms the ones using only unimodal data as regional inputs.
arXiv Detail & Related papers (2021-05-06T07:44:05Z) - Region Similarity Representation Learning [94.88055458257081]
Region Similarity Representation Learning (ReSim) is a new approach to self-supervised representation learning for localization-based tasks.
ReSim learns both regional representations for localization as well as semantic image-level representations.
We show how ReSim learns representations which significantly improve the localization and classification performance compared to a competitive MoCo-v2 baseline.
arXiv Detail & Related papers (2021-03-24T00:42:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.