Related papers: Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces

Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces

URL: http://arxiv.org/abs/2408.07146v1
Date: Tue, 13 Aug 2024 18:32:06 GMT
Title: Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces
Authors: Zhiling Chen, Hanning Chen, Mohsen Imani, Ruimin Chen, Farhad Imani,
Abstract summary: We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance. The framework comprises four main modules: scene recognition, the visual prompt, safety items detection, and fine-grained verification. The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times two hundred times faster.
Score: 5.993182776695029
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Workplace accidents due to personal protective equipment (PPE) non-compliance raise serious safety concerns and lead to legal liabilities, financial penalties, and reputational damage. While object detection models have shown the capability to address this issue by identifying safety items, most existing models, such as YOLO, Faster R-CNN, and SSD, are limited in verifying the fine-grained attributes of PPE across diverse workplace scenarios. Vision language models (VLMs) are gaining traction for detection tasks by leveraging the synergy between visual and textual information, offering a promising solution to traditional object detection limitations in PPE recognition. Nonetheless, VLMs face challenges in consistently verifying PPE attributes due to the complexity and variability of workplace environments, requiring them to interpret context-specific language and visual cues simultaneously. We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance, which comprises four main modules: scene recognition, the visual prompt, safety items detection, and fine-grained verification. The scene recognition identifies the current scenario to determine the necessary safety gear. The visual prompt formulates the specific visual prompts needed for the detection process. The safety items detection identifies whether the required safety gear is being worn according to the specified scenario. Lastly, the fine-grained verification assesses whether the worn safety equipment meets the fine-grained attribute requirements. We conduct real-world case studies across six different scenarios. The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times two hundred times faster.

Related papers

Safe-Construct: Redefining Construction Safety Violation Recognition as 3D Multi-View Engagement Task [2.0811729303868005]
We introduce Safe-Construct, a framework that reformulates violation recognition as a 3D multi-view engagement task. Safe-Construct achieves a 7.6% improvement over state-of-the-art methods across four violation types.
arXiv Detail & Related papers (2025-04-15T05:21:09Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
Using Vision Language Models for Safety Hazard Identification in Construction [1.2343292905447238]
We propose and experimentally validated a Vision Language Model (VLM)-based framework for the identification of construction hazards. We evaluate state-of-the-art VLMs, including GPT-4o, Gemini, Llama 3.2, and InternVL2, using a custom dataset of 1100 construction site images.
arXiv Detail & Related papers (2025-04-12T05:11:23Z)
Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models [25.606641582511106]
We propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance. Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks.
arXiv Detail & Related papers (2025-01-30T17:59:45Z)
MLLM-as-a-Judge for Image Safety without Human Labeling [81.24707039432292]
In the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content. It is crucial to identify such unsafe images based on established safety rules. Existing approaches typically fine-tune MLLMs with human-labeled datasets.
arXiv Detail & Related papers (2024-12-31T00:06:04Z)
On the Black-box Explainability of Object Detection Models for Safe and Trustworthy Industrial Applications [7.848637922112521]
We focus on model-agnostic XAI methods for object detection models and propose D-P, an extension of the Morphological Fragmental Perturbation Pyramid (P), which uses segmentation-based mask generation. We evaluate these methods on real-world industrial and robotic datasets, examining the influence of parameters such as the number of masks, model size, and image resolution on the quality of explanations.
arXiv Detail & Related papers (2024-10-28T13:28:05Z)
Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety. For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context. We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z)
Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions. Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue. We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z)
The Need for Guardrails with Large Language Models in Medical Safety-Critical Settings: An Artificial Intelligence Application in the Pharmacovigilance Ecosystem [0.6965384453064829]
Large language models (LLMs) are useful tools with the capacity for performing specific types of knowledge work at an effective scale. However, deployments in high-risk and safety-critical domains pose unique challenges, notably the issue of hallucinations. This is particularly concerning in settings such as drug safety, where inaccuracies could lead to patient harm. We have developed and demonstrated a proof of concept suite of guardrails specifically designed to mitigate certain types of hallucinations and errors for drug safety.
arXiv Detail & Related papers (2024-07-01T19:52:41Z)
SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models [63.946809247201905]
We introduce a new benchmark, namely SHIELD, to evaluate the ability of MLLMs on face spoofing and forgery detection. We design true/false and multiple-choice questions to evaluate multimodal face data in these two face security tasks. The results indicate that MLLMs hold substantial potential in the face security domain.
arXiv Detail & Related papers (2024-02-06T17:31:36Z)
Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection [51.66174565170112]
We introduce a novel approach to utilize the strengths of large language models in understanding contextual appearance variations. We propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection.
arXiv Detail & Related papers (2023-11-02T06:38:19Z)
Exploiting Multi-Object Relationships for Detecting Adversarial Attacks in Complex Scenes [51.65308857232767]
Vision systems that deploy Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples. Recent research has shown that checking the intrinsic consistencies in the input data is a promising way to detect adversarial attacks. We develop a novel approach to perform context consistency checks using language models.
arXiv Detail & Related papers (2021-08-19T00:52:10Z)
I-ViSE: Interactive Video Surveillance as an Edge Service using Unsupervised Feature Queries [70.69741666849046]
This paper proposes an Interactive Video Surveillance as an Edge service (I-ViSE) based on unsupervised feature queries. An I-ViSE prototype is built following the edge-fog computing paradigm and the experimental results verified the I-ViSE scheme meets the design goal of scene recognition in less than two seconds.
arXiv Detail & Related papers (2020-03-09T14:26:45Z)
DEEVA: A Deep Learning and IoT Based Computer Vision System to Address Safety and Security of Production Sites in Energy Industry [0.0]
This paper tackles various computer vision related problems such as scene classification, object detection in scenes, semantic segmentation, scene captioning etc. We developed Deep ExxonMobil Eye for Video Analysis (DEEVA) package to handle scene classification, object detection, semantic segmentation and captioning of scenes. The results reveal that transfer learning with the RetinaNet object detector is able to detect the presence of workers, different types of vehicles/construction equipment, safety related objects at a high level of accuracy (above 90%)
arXiv Detail & Related papers (2020-03-02T21:26:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.