Towards Large-scale Building Attribute Mapping using Crowdsourced
Images: Scene Text Recognition on Flickr and Problems to be Solved
- URL: http://arxiv.org/abs/2309.08042v1
- Date: Thu, 14 Sep 2023 22:02:14 GMT
- Title: Towards Large-scale Building Attribute Mapping using Crowdsourced
Images: Scene Text Recognition on Flickr and Problems to be Solved
- Authors: Yao Sun, Anna Kruspe, Liqiu Meng, Yifan Tian, Eike J Hoffmann, Stefan
Auer, Xiao Xiang Zhu
- Abstract summary: This work addresses the challenges in applying Scene Text Recognition in crowdsourced street-view images for building attribute mapping.
A Berlin Flickr dataset is created, and pre-trained STR models are used for text detection and recognition.
We examined the correlation between STR results and building functions, and analysed instances where texts were recognized on residential buildings but not on commercial ones.
- Score: 16.272425120319095
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Crowdsourced platforms provide huge amounts of street-view images that
contain valuable building information. This work addresses the challenges in
applying Scene Text Recognition (STR) in crowdsourced street-view images for
building attribute mapping. We use Flickr images, particularly examining texts
on building facades. A Berlin Flickr dataset is created, and pre-trained STR
models are used for text detection and recognition. Manual checking on a subset
of STR-recognized images demonstrates high accuracy. We examined the
correlation between STR results and building functions, and analysed instances
where texts were recognized on residential buildings but not on commercial
ones. Further investigation revealed significant challenges associated with
this task, including small text regions in street-view images, the absence of
ground truth labels, and mismatches in buildings in Flickr images and building
footprints in OpenStreetMap (OSM). To develop city-wide mapping beyond urban
hotspot locations, we suggest differentiating the scenarios where STR proves
effective while developing appropriate algorithms or bringing in additional
data for handling other cases. Furthermore, interdisciplinary collaboration
should be undertaken to understand the motivation behind building photography
and labeling. The STR-on-Flickr results are publicly available at
https://github.com/ya0-sun/STR-Berlin.
Related papers
- ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - RSBuilding: Towards General Remote Sensing Image Building Extraction and Change Detection with Foundation Model [22.56227565913003]
We propose a comprehensive remote sensing image building model, termed RSBuilding, developed from the perspective of the foundation model.
RSBuilding is designed to enhance cross-scene generalization and task understanding.
Our model was trained on a dataset comprising up to 245,000 images and validated on multiple building extraction and change detection datasets.
arXiv Detail & Related papers (2024-03-12T11:51:59Z) - There is a Time and Place for Reasoning Beyond the Image [63.96498435923328]
Images often more significant than only the pixels to human eyes, as we can infer, associate, and reason with contextual information from other sources to establish a more complete picture.
We introduce TARA: a dataset with 16k images with their associated news, time and location automatically extracted from New York Times (NYT), and an additional 61k examples as distant supervision from WIT.
We show that there exists a 70% gap between a state-of-the-art joint model and human performance, which is slightly filled by our proposed model that uses segment-wise reasoning, motivating higher-level vision-language joint models that
arXiv Detail & Related papers (2022-03-01T21:52:08Z) - Using Social Media Images for Building Function Classification [12.99941371793082]
This study proposes a filtering pipeline to yield high quality, ground level imagery from large social media image datasets.
We analyze our method on a culturally diverse social media dataset from Flickr with more than 28 million images from 42 cities around the world.
Fine-tuned state-of-the-art architectures yield F1-scores of up to 0.51 on the filtered images.
arXiv Detail & Related papers (2022-02-15T11:05:10Z) - Learning Co-segmentation by Segment Swapping for Retrieval and Discovery [67.6609943904996]
The goal of this work is to efficiently identify visually similar patterns from a pair of images.
We generate synthetic training pairs by selecting object segments in an image and copy-pasting them into another image.
We show our approach provides clear improvements for artwork details retrieval on the Brueghel dataset.
arXiv Detail & Related papers (2021-10-29T16:51:16Z) - TMBuD: A dataset for urban scene building detection [0.0]
This paper introduces a dataset solution, the TMBuD, that is better fitted for image processing on human made structures for urban scene scenarios.
The proposed dataset will allow proper evaluation of salient edges and semantic segmentation of images focusing on the street view perspective of buildings.
The dataset features 160 images of buildings from Timisoara, Romania, with a resolution of 768 x 1024 pixels each.
arXiv Detail & Related papers (2021-10-27T17:08:11Z) - Mapping Vulnerable Populations with AI [23.732584273099054]
Building functions shall be retrieved by parsing social media data like for instance tweets, as well as ground-based imagery.
Building maps augmented with those additional attributes make it possible to derive more accurate population density maps.
arXiv Detail & Related papers (2021-07-29T15:52:11Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - Bounding Boxes Are All We Need: Street View Image Classification via
Context Encoding of Detected Buildings [7.1235778791928634]
"Detector-Encoder-Classifier" framework is proposed.
"BEAUTY" dataset can be used not only for street view image classification, but also for multi-class building detection.
arXiv Detail & Related papers (2020-10-03T08:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.