Cross-Modal Learning of Housing Quality in Amsterdam
- URL: http://arxiv.org/abs/2403.08915v1
- Date: Wed, 13 Mar 2024 19:11:58 GMT
- Title: Cross-Modal Learning of Housing Quality in Amsterdam
- Authors: Alex Levering, Diego Marcos, Devis Tuia,
- Abstract summary: We test data and models for the recognition of housing quality in Amsterdam from ground-level and aerial imagery.
For ground-level images we compare Google StreetView (GSV) to Flickr images.
Our results show that GSV predicts the most accurate building quality scores, approximately 30% better than using only aerial images.
- Score: 7.316396716394438
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In our research we test data and models for the recognition of housing quality in the city of Amsterdam from ground-level and aerial imagery. For ground-level images we compare Google StreetView (GSV) to Flickr images. Our results show that GSV predicts the most accurate building quality scores, approximately 30% better than using only aerial images. However, we find that through careful filtering and by using the right pre-trained model, Flickr image features combined with aerial image features are able to halve the performance gap to GSV features from 30% to 15%. Our results indicate that there are viable alternatives to GSV for liveability factor prediction, which is encouraging as GSV images are more difficult to acquire and not always available.
Related papers
- Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark [63.296342841358815]
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images.
The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering.
We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
arXiv Detail & Related papers (2024-07-18T17:59:30Z) - PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning [78.23573511641548]
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications.
Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources.
This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for video understanding.
arXiv Detail & Related papers (2024-04-25T19:29:55Z) - Contrastive Region Guidance: Improving Grounding in Vision-Language
Models without Training [79.27663870280038]
We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source vision-language models to respond to visual prompts.
When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench.
We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp.
arXiv Detail & Related papers (2024-03-04T18:55:30Z) - A citizen science toolkit to collect human perceptions of urban environments using open street view images [0.20999222360659603]
Street View Imagery (SVI) is a valuable data source for studies (e.g., environmental assessments, green space identification or land cover classification)
Open SVI datasets are readily available from less restrictive sources, such as Mapillary.
We present an efficient method for automated downloading, processing, cropping, and filtering open SVI.
arXiv Detail & Related papers (2024-02-29T22:58:13Z) - Raising the Bar of AI-generated Image Detection with CLIP [50.345365081177555]
The aim of this work is to explore the potential of pre-trained vision-language models (VLMs) for universal detection of AI-generated images.
We develop a lightweight detection strategy based on CLIP features and study its performance in a wide variety of challenging scenarios.
arXiv Detail & Related papers (2023-11-30T21:11:20Z) - RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks.
ViTs are notoriously hard to train and require a lot of training data to learn powerful representations.
We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z) - Danish Airs and Grounds: A Dataset for Aerial-to-Street-Level Place
Recognition and Localization [9.834635805575584]
We contribute with the emphDanish Airs and Grounds dataset, a large collection of street-level and aerial images targeting such cases.
The dataset is larger and more diverse than current publicly available data, including more than 50 km of road in urban, suburban and rural areas.
We propose a map-to-image re-localization pipeline, that first estimates a dense 3D reconstruction from the aerial images and then matches query street-level images to street-level renderings of the 3D model.
arXiv Detail & Related papers (2022-02-03T19:58:09Z) - With a Little Help from My Friends: Nearest-Neighbor Contrastive
Learning of Visual Representations [87.72779294717267]
Using the nearest-neighbor as positive in contrastive losses improves performance significantly on ImageNet classification.
We demonstrate empirically that our method is less reliant on complex data augmentations.
arXiv Detail & Related papers (2021-04-29T17:56:08Z) - Aerial Imagery Pixel-level Segmentation [0.4079265319364249]
We bridge the performance-gap between popular datasets and aerial imagery data.
Our work, using the state-of-the-art DeepLabv3+ Xception65 architecture, achieves a mean IOU of 70% on the DroneDeploy validation set.
arXiv Detail & Related papers (2020-12-03T16:09:09Z) - Bounding Boxes Are All We Need: Street View Image Classification via
Context Encoding of Detected Buildings [7.1235778791928634]
"Detector-Encoder-Classifier" framework is proposed.
"BEAUTY" dataset can be used not only for street view image classification, but also for multi-class building detection.
arXiv Detail & Related papers (2020-10-03T08:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.