Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)
- URL: http://arxiv.org/abs/2408.00932v1
- Date: Thu, 1 Aug 2024 21:50:23 GMT
- Title: Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models (Vision Paper)
- Authors: Bin Han, Yiwei Yang, Anat Caspi, Bill Howe,
- Abstract summary: Equitable urban transportation applications require high-fidelity digital representations of the built environment.
We consider vision language models as a mechanism for annotating diverse urban features from satellite images.
We demonstrate proof-of-concept combining a state-of-the-art vision language model and variants of a prompting strategy.
- Score: 8.071443524030302
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Equitable urban transportation applications require high-fidelity digital representations of the built environment: not just streets and sidewalks, but bike lanes, marked and unmarked crossings, curb ramps and cuts, obstructions, traffic signals, signage, street markings, potholes, and more. Direct inspections and manual annotations are prohibitively expensive at scale. Conventional machine learning methods require substantial annotated training data for adequate performance. In this paper, we consider vision language models as a mechanism for annotating diverse urban features from satellite images, reducing the dependence on human annotation to produce large training sets. While these models have achieved impressive results in describing common objects in images captured from a human perspective, their training sets are less likely to include strong signals for esoteric features in the built environment, and their performance in these settings is therefore unclear. We demonstrate proof-of-concept combining a state-of-the-art vision language model and variants of a prompting strategy that asks the model to consider segmented elements independently of the original image. Experiments on two urban features -- stop lines and raised tables -- show that while direct zero-shot prompting correctly annotates nearly zero images, the pre-segmentation strategies can annotate images with near 40% intersection-over-union accuracy. We describe how these results inform a new research agenda in automatic annotation of the built environment to improve equity, accessibility, and safety at broad scale and in diverse environments.
Related papers
- Exploiting Contextual Uncertainty of Visual Data for Efficient Training of Deep Models [0.65268245109828]
We introduce the notion of contextual diversity for active learning CDAL.
We propose a data repair algorithm to curate contextually fair data to reduce model bias.
We are working on developing image retrieval system for wildlife camera trap images and reliable warning system for poor quality rural roads.
arXiv Detail & Related papers (2024-11-04T09:43:33Z) - AIDOVECL: AI-generated Dataset of Outpainted Vehicles for Eye-level Classification and Localization [0.0]
This work introduces a novel approach that leverages outpainting to address the problem of annotated data scarcity.
We apply this technique to a particularly acute challenge in autonomous driving, urban planning, and environmental monitoring.
Augmentation with outpainted vehicles improves overall performance metrics by up to 8% and enhances prediction of underrepresented classes by up to 20%.
arXiv Detail & Related papers (2024-10-31T16:46:23Z) - Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction [60.964512894143475]
We present Generative Spatial Transformer ( GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction.
Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction.
arXiv Detail & Related papers (2024-10-24T17:58:05Z) - StreetSurfaceVis: a dataset of crowdsourced street-level imagery annotated by road surface type and quality [41.94295877935867]
We introduce StreetSurfaceVis, a novel dataset comprising 9,122 street-level images mostly from Germany.
We aim to enable robust models that maintain high accuracy across diverse image sources.
arXiv Detail & Related papers (2024-07-31T08:59:33Z) - AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - Information Theoretic Text-to-Image Alignment [49.396917351264655]
We present a novel method that relies on an information-theoretic alignment measure to steer image generation.
Our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z) - Enhancing Large Vision Language Models with Self-Training on Image Comprehension [99.9389737339175]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Street-View Image Generation from a Bird's-Eye View Layout [95.36869800896335]
Bird's-Eye View (BEV) Perception has received increasing attention in recent years.
Data-driven simulation for autonomous driving has been a focal point of recent research.
We propose BEVGen, a conditional generative model that synthesizes realistic and spatially consistent surrounding images.
arXiv Detail & Related papers (2023-01-11T18:39:34Z) - Mitigating Urban-Rural Disparities in Contrastive Representation Learning with Satellite Imagery [19.93324644519412]
We consider the risk of urban-rural disparities in identification of land-cover features.
We propose fair dense representation with contrastive learning (FairDCL) as a method for de-biasing the multi-level latent space of convolution neural network models.
The obtained image representation mitigates downstream urban-rural prediction disparities and outperforms state-of-the-art baselines on real-world satellite images.
arXiv Detail & Related papers (2022-11-16T04:59:46Z) - Imposing Consistency for Optical Flow Estimation [73.53204596544472]
Imposing consistency through proxy tasks has been shown to enhance data-driven learning.
This paper introduces novel and effective consistency strategies for optical flow estimation.
arXiv Detail & Related papers (2022-04-14T22:58:30Z) - Road images augmentation with synthetic traffic signs using neural
networks [3.330229314824913]
We consider the task of rare traffic sign detection and classification.
We aim to solve that problem by using synthetic training data.
We propose three methods for making synthetic signs consistent with a scene in appearance.
arXiv Detail & Related papers (2021-01-13T08:10:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.