Automated Hazard Detection in Construction Sites Using Large Language and Vision-Language Models
- URL: http://arxiv.org/abs/2511.15720v1
- Date: Thu, 13 Nov 2025 02:23:45 GMT
- Title: Automated Hazard Detection in Construction Sites Using Large Language and Vision-Language Models
- Authors: Islem Sahraoui,
- Abstract summary: This thesis explores a multimodal AI framework for enhancing construction safety through the combined analysis of textual and visual data.<n>Two case studies were consucted to evaluate the capabilities of large language models (LLMs) and vision-language models (VLMs) for automated hazard identification.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This thesis explores a multimodal AI framework for enhancing construction safety through the combined analysis of textual and visual data. In safety-critical environments such as construction sites, accident data often exists in multiple formats, such as written reports, inspection records, and site imagery, making it challenging to synthesize hazards using traditional approaches. To address this, this thesis proposed a multimodal AI framework that combines text and image analysis to assist in identifying safety hazards on construction sites. Two case studies were consucted to evaluate the capabilities of large language models (LLMs) and vision-language models (VLMs) for automated hazard identification.The first case study introduces a hybrid pipeline that utilizes GPT 4o and GPT 4o mini to extract structured insights from a dataset of 28,000 OSHA accident reports (2000-2025). The second case study extends this investigation using Molmo 7B and Qwen2 VL 2B, lightweight, open-source VLMs. Using the public ConstructionSite10k dataset, the performance of the two models was evaluated on rule-level safety violation detection using natural language prompts. This experiment served as a cost-aware benchmark against proprietary models and allowed testing at scale with ground-truth labels. Despite their smaller size, Molmo 7B and Quen2 VL 2B showed competitive performance in certain prompt configurations, reinforcing the feasibility of low-resource multimodal systems for rule-aware safety monitoring.
Related papers
- Semantically Aware UAV Landing Site Assessment from Remote Sensing Imagery via Multimodal Large Language Models [5.987458168544856]
Safe UAV emergency landing requires understanding complex semantic risks invisible to traditional geometric sensors.<n>We propose a novel framework leveraging Remote Sensing (RS) imagery and Multimodal Large Language Models (MLLMs) for context-aware landing site assessment.
arXiv Detail & Related papers (2026-02-01T11:30:03Z) - Toward Autonomous Laboratory Safety Monitoring with Vision Language Models: Learning to See Hazards Through Scene Structure [26.434430112145137]
Laboratories are prone to severe injuries from minor unsafe actions.<n> continuous safety monitoring is limited by human availability.<n>Vision language models (VLMs) offer promise for autonomous laboratory safety monitoring.
arXiv Detail & Related papers (2026-01-31T00:08:41Z) - Cross-modal Retrieval Models for Stripped Binary Analysis [62.89251403093734]
BinSeek is the first two-stage cross-modal retrieval framework for stripped binary code analysis.<n>It consists of two models: BinSeekEmbedding is trained on large-scale dataset to learn the semantic relevance of the binary code.<n>BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation.
arXiv Detail & Related papers (2025-12-11T07:58:10Z) - Automating construction safety inspections using a multi-modal vision-language RAG framework [1.737994603273206]
This study introduces SiteShield, a framework for automating construction safety inspection reports by integrating visual and audio inputs.<n>Using real-world data, SiteShield outperformed unimodal LLMs with an F1 score of 0.82, hamming loss of 0.04, precision of 0.76, and recall of 0.96.
arXiv Detail & Related papers (2025-10-05T10:48:54Z) - Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models [7.916129615051081]
We introduce a dataset comprising over 34,000 synthetic images generated by diffusion models.<n>The dataset includes 214 human-annotated images that serve as a gold-standard reference for validation.
arXiv Detail & Related papers (2025-06-25T07:06:29Z) - OVERT: A Benchmark for Over-Refusal Evaluation on Text-to-Image Models [91.55634905861827]
Over-refusal is a phenomenon known as $textitover-refusal$ that reduces the practical utility of T2I models.<n>We present OVERT ($textbfOVE$r-$textbfR$efusal evaluation on $textbfT$ext-to-image models), the first large-scale benchmark for assessing over-refusal behaviors.
arXiv Detail & Related papers (2025-05-27T15:42:46Z) - Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning [58.16354555208417]
PAD and FFD are proposed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes, respectively.<n>The lack of a Unified Face Attack Detection model to simultaneously handle attacks in these two categories is mainly attributed to two factors.<n>We present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework that adaptively explores multiple classification criteria from different semantic spaces.
arXiv Detail & Related papers (2025-05-19T16:35:45Z) - BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models [50.17907898478795]
We introduce BinMetric, a benchmark designed to evaluate the performance of large language models on binary analysis tasks.<n>BinMetric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks.<n>Our empirical study on this benchmark investigates the binary analysis capabilities of various state-of-the-art LLMs, revealing their strengths and limitations in this field.
arXiv Detail & Related papers (2025-05-12T08:54:07Z) - Advancing Neural Network Verification through Hierarchical Safety Abstract Interpretation [52.626086874715284]
We introduce a novel problem formulation called Abstract DNN-Verification, which verifies a hierarchical structure of unsafe outputs.<n>By leveraging abstract interpretation and reasoning about output reachable sets, our approach enables assessing multiple safety levels during the formal verification process.<n>Our contributions include a theoretical exploration of the relationship between our novel abstract safety formulation and existing approaches.
arXiv Detail & Related papers (2025-05-08T13:29:46Z) - T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content.<n>We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats.<n>Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z) - Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety [0.0]
We propose a multimodal approach that integrates vision-language reasoning with zero-shot object detection.<n>We refine object detection by incorporating OpenAI's CLIP model to match predicted hazards with bounding box annotations.<n>Our findings highlight the strengths and limitations of current vision-language-based approaches.
arXiv Detail & Related papers (2025-04-18T01:25:02Z) - Using Vision Language Models for Safety Hazard Identification in Construction [1.2343292905447238]
We propose and experimentally validated a Vision Language Model (VLM)-based framework for the identification of construction hazards.<n>We evaluate state-of-the-art VLMs, including GPT-4o, Gemini, Llama 3.2, and InternVL2, using a custom dataset of 1100 construction site images.
arXiv Detail & Related papers (2025-04-12T05:11:23Z) - Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events [5.233512464561313]
Multimodal Large Language Models (MLLMs) offer a novel approach by integrating textual, visual, and audio modalities.
Our framework leverages the reasoning power of MLLMs, directing their output through context-specific prompts.
Preliminary results demonstrate the framework's potential in zero-shot learning and accurate scenario analysis.
arXiv Detail & Related papers (2024-06-19T23:50:41Z) - Learning Traffic Crashes as Language: Datasets, Benchmarks, and What-if Causal Analyses [76.59021017301127]
We propose a large-scale traffic crash language dataset, named CrashEvent, summarizing 19,340 real-world crash reports.
We further formulate the crash event feature learning as a novel text reasoning problem and further fine-tune various large language models (LLMs) to predict detailed accident outcomes.
Our experiments results show that our LLM-based approach not only predicts the severity of accidents but also classifies different types of accidents and predicts injury outcomes.
arXiv Detail & Related papers (2024-06-16T03:10:16Z) - Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users.
Recent works have proposed algorithms to detect LLM-generated text and protect LLMs.
We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.