Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces
- URL: http://arxiv.org/abs/2408.07146v1
- Date: Tue, 13 Aug 2024 18:32:06 GMT
- Title: Vision Language Model for Interpretable and Fine-grained Detection of Safety Compliance in Diverse Workplaces
- Authors: Zhiling Chen, Hanning Chen, Mohsen Imani, Ruimin Chen, Farhad Imani,
- Abstract summary: We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance.
The framework comprises four main modules: scene recognition, the visual prompt, safety items detection, and fine-grained verification.
The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times two hundred times faster.
- Score: 5.993182776695029
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Workplace accidents due to personal protective equipment (PPE) non-compliance raise serious safety concerns and lead to legal liabilities, financial penalties, and reputational damage. While object detection models have shown the capability to address this issue by identifying safety items, most existing models, such as YOLO, Faster R-CNN, and SSD, are limited in verifying the fine-grained attributes of PPE across diverse workplace scenarios. Vision language models (VLMs) are gaining traction for detection tasks by leveraging the synergy between visual and textual information, offering a promising solution to traditional object detection limitations in PPE recognition. Nonetheless, VLMs face challenges in consistently verifying PPE attributes due to the complexity and variability of workplace environments, requiring them to interpret context-specific language and visual cues simultaneously. We introduce Clip2Safety, an interpretable detection framework for diverse workplace safety compliance, which comprises four main modules: scene recognition, the visual prompt, safety items detection, and fine-grained verification. The scene recognition identifies the current scenario to determine the necessary safety gear. The visual prompt formulates the specific visual prompts needed for the detection process. The safety items detection identifies whether the required safety gear is being worn according to the specified scenario. Lastly, the fine-grained verification assesses whether the worn safety equipment meets the fine-grained attribute requirements. We conduct real-world case studies across six different scenarios. The results show that Clip2Safety not only demonstrates an accuracy improvement over state-of-the-art question-answering based VLMs but also achieves inference times two hundred times faster.
Related papers
- Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models [25.606641582511106]
We propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance.
Our experiments demonstrate that fine-tuning InternVL2.5-8B with MIS significantly outperforms both powerful open-source models and API-based models in challenging multi-image tasks.
arXiv Detail & Related papers (2025-01-30T17:59:45Z) - MLLM-as-a-Judge for Image Safety without Human Labeling [81.24707039432292]
In the age of AI-generated content (AIGC), many image generation models are capable of producing harmful content.
It is crucial to identify such unsafe images based on established safety rules.
Existing approaches typically fine-tune MLLMs with human-labeled datasets.
arXiv Detail & Related papers (2024-12-31T00:06:04Z) - On the Black-box Explainability of Object Detection Models for Safe and Trustworthy Industrial Applications [7.848637922112521]
We focus on model-agnostic explainability methods for object detection models and propose D-P, an extension of the Morphological Fragmental Perturbation Pyramid (P) technique to generate explanations.
We evaluate these methods on real-world industrial and robotic datasets, examining the influence of parameters such as the number of masks, model size, and image resolution on the quality of explanations.
arXiv Detail & Related papers (2024-10-28T13:28:05Z) - Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety.
For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context.
We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z) - SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection
with Multimodal Large Language Models [63.946809247201905]
We introduce a new benchmark, namely SHIELD, to evaluate the ability of MLLMs on face spoofing and forgery detection.
We design true/false and multiple-choice questions to evaluate multimodal face data in these two face security tasks.
The results indicate that MLLMs hold substantial potential in the face security domain.
arXiv Detail & Related papers (2024-02-06T17:31:36Z) - Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection [51.66174565170112]
We introduce a novel approach to utilize the strengths of large language models in understanding contextual appearance variations.
We propose to formulate language-derived appearance elements and incorporate them with visual cues in pedestrian detection.
arXiv Detail & Related papers (2023-11-02T06:38:19Z) - Exploiting Multi-Object Relationships for Detecting Adversarial Attacks
in Complex Scenes [51.65308857232767]
Vision systems that deploy Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples.
Recent research has shown that checking the intrinsic consistencies in the input data is a promising way to detect adversarial attacks.
We develop a novel approach to perform context consistency checks using language models.
arXiv Detail & Related papers (2021-08-19T00:52:10Z) - I-ViSE: Interactive Video Surveillance as an Edge Service using
Unsupervised Feature Queries [70.69741666849046]
This paper proposes an Interactive Video Surveillance as an Edge service (I-ViSE) based on unsupervised feature queries.
An I-ViSE prototype is built following the edge-fog computing paradigm and the experimental results verified the I-ViSE scheme meets the design goal of scene recognition in less than two seconds.
arXiv Detail & Related papers (2020-03-09T14:26:45Z) - DEEVA: A Deep Learning and IoT Based Computer Vision System to Address
Safety and Security of Production Sites in Energy Industry [0.0]
This paper tackles various computer vision related problems such as scene classification, object detection in scenes, semantic segmentation, scene captioning etc.
We developed Deep ExxonMobil Eye for Video Analysis (DEEVA) package to handle scene classification, object detection, semantic segmentation and captioning of scenes.
The results reveal that transfer learning with the RetinaNet object detector is able to detect the presence of workers, different types of vehicles/construction equipment, safety related objects at a high level of accuracy (above 90%)
arXiv Detail & Related papers (2020-03-02T21:26:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.