Related papers: ELEV-VISION-SAM: Integrated Vision Language and Foundation Model for Automated Estimation of Building Lowest Floor Elevation

ELEV-VISION-SAM: Integrated Vision Language and Foundation Model for Automated Estimation of Building Lowest Floor Elevation

URL: http://arxiv.org/abs/2404.12606v1
Date: Fri, 19 Apr 2024 03:16:08 GMT
Title: ELEV-VISION-SAM: Integrated Vision Language and Foundation Model for Automated Estimation of Building Lowest Floor Elevation
Authors: Yu-Hsuan Ho, Longxiang Li, Ali Mostafavi,
Abstract summary: This study integrates the Segment Anything model, a segmentation foundation model, with vision language models to conduct text-prompt image segmentation on street view images for LFE estimation. Our proposed method significantly enhances the availability of LFE estimation to almost all properties in which the front door is visible in the street view image.
Score: 1.2070884166650049
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Street view imagery, aided by advancements in image quality and accessibility, has emerged as a valuable resource for urban analytics research. Recent studies have explored its potential for estimating lowest floor elevation (LFE), offering a scalable alternative to traditional on-site measurements, crucial for assessing properties' flood risk and damage extent. While existing methods rely on object detection, the introduction of image segmentation has broadened street view images' utility for LFE estimation, although challenges still remain in segmentation quality and capability to distinguish front doors from other doors. To address these challenges in LFE estimation, this study integrates the Segment Anything model, a segmentation foundation model, with vision language models to conduct text-prompt image segmentation on street view images for LFE estimation. By evaluating various vision language models, integration methods, and text prompts, we identify the most suitable model for street view image analytics and LFE estimation tasks, thereby improving the availability of the current LFE estimation model based on image segmentation from 33% to 56% of properties. Remarkably, our proposed method significantly enhances the availability of LFE estimation to almost all properties in which the front door is visible in the street view image. Also the findings present the first baseline and comparison of various vision models of street view image-based LFE estimation. The model and findings not only contribute to advancing street view image segmentation for urban analytics but also provide a novel approach for image segmentation tasks for other civil engineering and infrastructure analytics tasks.

Related papers

Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models [27.848962405476108]
New pipeline constructs reasoning-oriented geo-localization dataset, MP16-Reason, using diverse social media images.<n>We introduce GLOBE, Group-relative policy optimization for Locatability assessment and optimized visual-clue reasoning.<n>Results demonstrate that GLOBE outperforms state-of-the-art open-source LVLMs on geo-localization tasks.
arXiv Detail & Related papers (2025-06-17T16:07:58Z)
AetherVision-Bench: An Open-Vocabulary RGB-Infrared Benchmark for Multi-Angle Segmentation across Aerial and Ground Perspectives [2.0293118701268154]
Embodied AI systems are transforming autonomous navigation for ground vehicles and drones by enhancing their perception abilities.<n>We present AetherVision-Bench, a benchmark for multi-angle segmentation across aerial, and ground perspectives.<n>We assess state-of-the-art OVSS models on the proposed benchmark and investigate the key factors that impact the performance of zero-shot transfer models.
arXiv Detail & Related papers (2025-06-04T08:41:19Z)
Image-based Visibility Analysis Replacing Line-of-Sight Simulation: An Urban Landmark Perspective [2.3315115235829342]
The study challenges the traditional LoS-based approaches by introducing a new, image-based visibility analysis method.<n>In the first case study, the method proves its reliability in detecting the visibility of six tall landmark constructions in global cities, with an overall accuracy of 87%.<n>In the second case, the proposed visibility graph uncovers the form and strength of connections for multiple landmarks along the River Thames in London.
arXiv Detail & Related papers (2025-05-17T03:41:45Z)
V$^2$R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations [1.7971686967440696]
V$2$R-Bench is a benchmark framework for evaluating Visual Variation Robustness of LVLMs. We show that advanced models that excel at complex vision-language tasks significantly underperform on simple tasks such as object recognition. These vulnerabilities stem from error accumulation in the pipeline architecture and inadequate multimodal alignment.
arXiv Detail & Related papers (2025-04-23T14:01:32Z)
Vision-Language Model for Object Detection and Segmentation: A Review and Evaluation [38.20492321295552]
Vision-Language Model (VLM) have gained widespread adoption in Open-Vocabulary (OV) object detection and segmentation tasks. Despite they have shown promise on OV-related tasks, their effectiveness in conventional vision tasks has thus far been unevaluated.
arXiv Detail & Related papers (2025-04-13T08:28:13Z)
OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery [4.33299613844962]
Building properties play a crucial role in spatial data infrastructures, supporting applications such as energy simulation, risk assessment, and environmental modeling. Recent advances have enabled the extraction and tagging of objective building attributes using remote sensing and street-level imagery. This study bridges the gaps by introducing OpenFACADES, an open framework that leverages crowdsourced data to enrich building profiles.
arXiv Detail & Related papers (2025-04-01T08:20:13Z)
VLEER: Vision and Language Embeddings for Explainable Whole Slide Image Representation [3.695317701129061]
We introduce Vision and Language Embeddings for Explainable WSI Representation (VLEER), a novel method designed to leverage vision features for WSI representation. VLEER offers the unique advantage of interpretability, enabling direct human-readable insights into the results.
arXiv Detail & Related papers (2025-02-28T08:49:03Z)
Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models [58.936893810674896]
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems. We introduce a multimodal large language model framework for FAS, termed Interpretable Face Anti-Spoofing (I-FAS) We propose a Spoof-aware Captioning and Filtering (SCF) strategy to generate high-quality captions for FAS images.
arXiv Detail & Related papers (2025-01-03T09:25:04Z)
Evaluation and Comparison of Visual Language Models for Transportation Engineering Problems [16.49637074299509]
We have explored state-of-the-art vision language models (VLM) for vision-based transportation engineering tasks. The image classification task involves congestion detection and crack identification, whereas, for object detection, helmet violations were identified. We have applied open-source models such as CLIP, BLIP, OWL-ViT, Llava-Next, and closed-source GPT-4o to evaluate the performance of these VLM models.
arXiv Detail & Related papers (2024-09-03T20:24:37Z)
Examining the Commitments and Difficulties Inherent in Multimodal Foundation Models for Street View Imagery [25.420749355424377]
This paper investigates the capabilities of ChatGPT-4V and Gemini Pro for Street View Imagery, Built Environment, and Interior. The results reveal proficiency in length measurement, style analysis, question answering, and basic image understanding. Overall, the findings demonstrate foundational multimodal intelligence, emphasizing the potential of FMs to drive forward interdisciplinary applications.
arXiv Detail & Related papers (2024-08-23T03:45:31Z)
Towards Latent Masked Image Modeling for Self-Supervised Visual Representation Learning [18.424840375721303]
Masked Image Modeling (MIM) has emerged as a promising method for deriving visual representations from unlabeled image data by predicting missing pixels from masked portions of images. A promising yet unrealized framework is learning representations through masked reconstruction in latent space, combining the locality of MIM with the high-level targets. This study is among the first to thoroughly analyze and address the challenges of such framework, which we refer to as Latent MIM.
arXiv Detail & Related papers (2024-07-22T17:54:41Z)
Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models [24.579822095003685]
We conduct an empirical study on representation learning for downstream Visual Question Answering (VQA) We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches.
arXiv Detail & Related papers (2024-07-22T12:26:08Z)
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation [87.50120181861362]
VisionPrefer is a high-quality and fine-grained preference dataset that captures multiple preference aspects. We train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators.
arXiv Detail & Related papers (2024-04-23T14:53:15Z)
Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification [0.8232137862012223]
This study investigates the potential of Large Language Models (LLMs) in generating and providing domain-specific information. To achieve this, an LLM is integrated into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors. Our findings reveal that the integration of LLM-based embeddings, in combination with general-purpose pre-trained embeddings, leads to substantial performance improvements.
arXiv Detail & Related papers (2024-03-18T18:08:44Z)
Explore In-Context Segmentation via Latent Diffusion Models [132.26274147026854]
latent diffusion model (LDM) is an effective minimalist for in-context segmentation. We build a new and fair in-context segmentation benchmark that includes both image and video datasets.
arXiv Detail & Related papers (2024-03-14T17:52:31Z)
RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images. We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z)
One-Shot Open Affordance Learning with Foundation Models [54.15857111929812]
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category. We propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings. Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data.
arXiv Detail & Related papers (2023-11-29T16:23:06Z)
PANet: Perspective-Aware Network with Dynamic Receptive Fields and Self-Distilling Supervision for Crowd Counting [63.84828478688975]
We propose a novel perspective-aware approach called PANet to address the perspective problem. Based on the observation that the size of the objects varies greatly in one image due to the perspective effect, we propose the dynamic receptive fields (DRF) framework. The framework is able to adjust the receptive field by the dilated convolution parameters according to the input image, which helps the model to extract more discriminative features for each local region.
arXiv Detail & Related papers (2021-10-31T04:43:05Z)
VinVL: Revisiting Visual Representations in Vision-Language Models [96.39332942534368]
We develop an improved object detection model to provide object-centric representations of images. New visual features significantly improve the performance across all vision language (VL) tasks. We will release the new object detection model to public.
arXiv Detail & Related papers (2021-01-02T23:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.