Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?
- URL: http://arxiv.org/abs/2508.11011v1
- Date: Thu, 14 Aug 2025 18:23:09 GMT
- Title: Are Large Pre-trained Vision Language Models Effective Construction Safety Inspectors?
- Authors: Xuezheng Chen, Zhengbo Zou,
- Abstract summary: Construction safety inspections typically involve a human inspector identifying safety concerns on-site.<n>With the rise of powerful Vision Language Models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Construction safety inspections typically involve a human inspector identifying safety concerns on-site. With the rise of powerful Vision Language Models (VLMs), researchers are exploring their use for tasks such as detecting safety rule violations from on-site images. However, there is a lack of open datasets to comprehensively evaluate and further fine-tune VLMs in construction safety inspection. Current applications of VLMs use small, supervised datasets, limiting their applicability in tasks they are not directly trained for. In this paper, we propose the ConstructionSite 10k, featuring 10,000 construction site images with annotations for three inter-connected tasks, including image captioning, safety rule violation visual question answering (VQA), and construction element visual grounding. Our subsequent evaluation of current state-of-the-art large pre-trained VLMs shows notable generalization abilities in zero-shot and few-shot settings, while additional training is needed to make them applicable to actual construction sites. This dataset allows researchers to train and evaluate their own VLMs with new architectures and techniques, providing a valuable benchmark for construction safety inspection.
Related papers
- Toward Autonomous Laboratory Safety Monitoring with Vision Language Models: Learning to See Hazards Through Scene Structure [26.434430112145137]
Laboratories are prone to severe injuries from minor unsafe actions.<n> continuous safety monitoring is limited by human availability.<n>Vision language models (VLMs) offer promise for autonomous laboratory safety monitoring.
arXiv Detail & Related papers (2026-01-31T00:08:41Z) - Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach [45.45569862912077]
Large-scale Vision Language Models (LVLMs) exhibit advanced capabilities in tasks that require visual information, including object detection.<n>In this study, we develop models and investigate the capabilities of LVLMs by constructing a dataset and evaluating their performance on this dataset.<n>Our experimental results demonstrate that while pre-trained LVLMs have limited effectiveness, fine-tuned LVLMs can generate accurate and safety-aware driving instructions.
arXiv Detail & Related papers (2025-11-28T16:09:36Z) - AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios [64.51320327698231]
We introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios.<n>We develop an innovative semi-automated collaborative agent-based labeling assistant framework.<n>We also propose HawkEyeTrack, a novel method that collaboratively enhances vision-language representation learning.
arXiv Detail & Related papers (2025-11-26T04:44:27Z) - HomeSafeBench: A Benchmark for Embodied Vision-Language Models in Free-Exploration Home Safety Inspection [45.2338049870908]
Embodied agents can identify and report safety hazards in the home environments.<n>Existing benchmarks suffer from two key limitations.<n>HomeSafeBench is a benchmark with 12,900 data points covering five common home safety hazards.
arXiv Detail & Related papers (2025-09-28T07:01:27Z) - Safety Assessment of Scaffolding on Construction Site using AI [0.0]
This paper explores the use of Artificial Intelligence (AI) and digitization to enhance the accuracy of scaffolding inspection.<n>A cloud-based AI platform is developed to process and analyse the point cloud data of scaffolding structure.
arXiv Detail & Related papers (2025-09-22T14:43:20Z) - Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities [54.94982467313341]
Vision-language Models (VLMs) have emerged as general-purpose tools for addressing a variety of complex computer vision problems.<n>We set out to understand the limitations of SoTA VLMs on fundamental visual tasks by constructing a series of tests that probe which components of design, specifically, may be lacking.
arXiv Detail & Related papers (2025-07-10T15:26:41Z) - AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions [76.74726258534142]
We propose AGENTSAFE, the first benchmark for evaluating the safety of embodied VLM agents under hazardous instructions.<n> AGENTSAFE simulates realistic agent-environment interactions within a simulation sandbox.<n> benchmark includes 45 adversarial scenarios, 1,350 hazardous tasks, and 8,100 hazardous instructions.
arXiv Detail & Related papers (2025-06-17T16:37:35Z) - SafeCOMM: What about Safety Alignment in Fine-Tuned Telecom Large Language Models? [74.5407418382515]
Fine-tuning large language models (LLMs) for telecom tasks and datasets is a common practice to adapt general-purpose models to the telecom domain.<n>Recent research has shown that even benign fine-tuning can degrade the safety alignment of LLMs, causing them to respond to harmful or unethical user queries.
arXiv Detail & Related papers (2025-05-29T13:31:51Z) - More Clear, More Flexible, More Precise: A Comprehensive Oriented Object Detection benchmark for UAV [58.89234732689013]
CODrone is a comprehensive oriented object detection dataset for UAVs that accurately reflects real-world conditions.<n>It also serves as a new benchmark designed to align with downstream task requirements.<n>We conduct a series of experiments based on 22 classical or SOTA methods to rigorously evaluate CODrone.
arXiv Detail & Related papers (2025-04-28T17:56:02Z) - Using Vision Language Models for Safety Hazard Identification in Construction [1.2343292905447238]
We propose and experimentally validated a Vision Language Model (VLM)-based framework for the identification of construction hazards.<n>We evaluate state-of-the-art VLMs, including GPT-4o, Gemini, Llama 3.2, and InternVL2, using a custom dataset of 1100 construction site images.
arXiv Detail & Related papers (2025-04-12T05:11:23Z) - Construction Site Scaffolding Completeness Detection Based on Mask R-CNN and Hough Transform [2.7309692684728617]
This paper proposes a deep learning-based approach to detect the scaffolding and its cross braces using computer vision.<n>A scaffold image dataset with annotated labels is used to train a convolutional neural network (CNN) model.
arXiv Detail & Related papers (2025-03-18T20:27:22Z) - AR-Facilitated Safety Inspection and Fall Hazard Detection on Construction Sites [17.943278018516416]
We are exploring the potential of head-mounted augmented reality to facilitate safety inspections on high-rise construction sites.<n>A particular concern in the industry is inspecting perimeter safety screens on higher levels of construction sites, intended to prevent falls of people and objects.<n>We aim to support workers performing this inspection task by tracking which parts of the safety screens have been inspected.<n>We use machine learning to automatically detect gaps in the perimeter screens that require closer inspection and remediation and to automate reporting.
arXiv Detail & Related papers (2024-12-02T08:38:43Z) - A Deep Learning Approach to Detect Complete Safety Equipment For Construction Workers Based On YOLOv7 [0.0]
In this study, a deep learning-based technique is presented for identifying safety gear worn by construction workers.
The recommended approach uses the YOLO v7 object detection algorithm to precisely locate these safety items.
Our trained model performed admirably well, with good precision, recall, and F1-score for safety equipment recognition.
arXiv Detail & Related papers (2024-06-11T20:38:41Z) - Uncovering the Inner Workings of STEGO for Safe Unsupervised Semantic
Segmentation [68.8204255655161]
Self-supervised pre-training strategies have recently shown impressive results for training general-purpose feature extraction backbones in computer vision.
The DINO self-distillation technique has interesting emerging properties, such as unsupervised clustering in the latent space and semantic correspondences of the produced features without using explicit human-annotated labels.
The STEGO method for unsupervised semantic segmentation contrast distills feature correspondences of a DINO-pre-trained Vision Transformer and recently set a new state of the art.
arXiv Detail & Related papers (2023-04-14T15:30:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.