Related papers: T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition

URL: http://arxiv.org/abs/2409.19734v2
Date: Wed, 2 Oct 2024 08:44:40 GMT
Title: T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition
Authors: Chen Yeh, You-Ming Chang, Wei-Chen Chiu, Ning Yu,
Abstract summary: Existing harmful datasets are curated by the presence of a narrow range of harmful objects. This hinders the generalizability of methods based on such datasets, potentially leading to misjudgments. We propose a comprehensive harmful dataset, consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models.
Score: 24.78672820633581
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To address the risks of encountering inappropriate or harmful content, researchers managed to incorporate several harmful contents datasets with machine learning methods to detect harmful concepts. However, existing harmful datasets are curated by the presence of a narrow range of harmful objects, and only cover real harmful content sources. This hinders the generalizability of methods based on such datasets, potentially leading to misjudgments. Therefore, we propose a comprehensive harmful dataset, Visual Harmful Dataset 11K (VHD11K), consisting of 10,000 images and 1,000 videos, crawled from the Internet and generated by 4 generative models, across a total of 10 harmful categories covering a full spectrum of harmful concepts with nontrivial definition. We also propose a novel annotation framework by formulating the annotation process as a multi-agent Visual Question Answering (VQA) task, having 3 different VLMs "debate" about whether the given image/video is harmful, and incorporating the in-context learning strategy in the debating process. Therefore, we can ensure that the VLMs consider the context of the given image/video and both sides of the arguments thoroughly before making decisions, further reducing the likelihood of misjudgments in edge cases. Evaluation and experimental results demonstrate that (1) the great alignment between the annotation from our novel annotation framework and those from human, ensuring the reliability of VHD11K; (2) our full-spectrum harmful dataset successfully identifies the inability of existing harmful content detection methods to detect extensive harmful contents and improves the performance of existing harmfulness recognition methods; (3) VHD11K outperforms the baseline dataset, SMID, as evidenced by the superior improvement in harmfulness recognition methods. The complete dataset and code can be found at https://github.com/nctu-eva-lab/VHD11K.

Related papers

Leveraging Pre-Trained Visual Models for AI-Generated Video Detection [54.88903878778194]
The field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content.<n>We propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos.<n>Our method achieves high detection accuracy, above 90% on average, underscoring its effectiveness.
arXiv Detail & Related papers (2025-07-17T15:36:39Z)
Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models [7.916129615051081]
We introduce a dataset comprising over 34,000 synthetic images generated by diffusion models.<n>The dataset includes 214 human-annotated images that serve as a gold-standard reference for validation.
arXiv Detail & Related papers (2025-06-25T07:06:29Z)
A Knowledge-guided Adversarial Defense for Resisting Malicious Visual Manipulation [93.28532038721816]
Malicious applications of visual manipulation have raised serious threats to the security and reputation of users in many fields. We propose a knowledge-guided adversarial defense (KGAD) to actively force malicious manipulation models to output semantically confusing samples.
arXiv Detail & Related papers (2025-04-11T10:18:13Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router [42.222681564769076]
We introduce HiddenGuard, a novel framework for fine-grained, safe generation in Large Language Models. HiddenGuard incorporates Prism, which operates alongside the LLM to enable real-time, token-level detection and redaction of harmful content. Our experiments demonstrate that HiddenGuard achieves over 90% in F1 score for detecting and redacting harmful content.
arXiv Detail & Related papers (2024-10-03T17:10:41Z)
Evidential Deep Partial Multi-View Classification With Discount Fusion [24.139495744683128]
We propose a novel framework called Evidential Deep Partial Multi-View Classification (EDP-MVC) We use K-means imputation to address missing views, creating a complete set of multi-view data. The potential conflicts and uncertainties within this imputed data can affect the reliability of downstream inferences.
arXiv Detail & Related papers (2024-08-23T14:50:49Z)
Regularized Contrastive Partial Multi-view Outlier Detection [76.77036536484114]
We propose a novel method named Regularized Contrastive Partial Multi-view Outlier Detection (RCPMOD) In this framework, we utilize contrastive learning to learn view-consistent information and distinguish outliers by the degree of consistency. Experimental results on four benchmark datasets demonstrate that our proposed approach could outperform state-of-the-art competitors.
arXiv Detail & Related papers (2024-08-02T14:34:27Z)
Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models [53.50543146583101]
Fine-tuning large language models on small datasets can enhance their performance on specific downstream tasks. Malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors. We propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data.
arXiv Detail & Related papers (2024-06-12T18:33:11Z)
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods. The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics. We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z)
HOD: A Benchmark Dataset for Harmful Object Detection [3.755082744150185]
We present a new benchmark dataset for harmful object detection. Our proposed dataset contains more than 10,000 images across 6 categories that might be harmful. We have conducted extensive experiments to evaluate the effectiveness of our proposed dataset.
arXiv Detail & Related papers (2023-10-08T15:00:38Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
Credible Remote Sensing Scene Classification Using Evidential Fusion on Aerial-Ground Dual-view Images [6.817740582240199]
Multi-view (multi-source, multi-modal, multi-perspective, etc.) data are being used more frequently in remote sensing tasks. The issue of data quality becomes more apparent, limiting the potential benefits of multi-view data. Deep learning is introduced to the task of aerial-ground dual-view remote sensing scene classification.
arXiv Detail & Related papers (2023-01-02T12:27:55Z)
Learning to Imagine: Diversify Memory for Incremental Learning using Unlabeled Data [69.30452751012568]
We develop a learnable feature generator to diversify exemplars by adaptively generating diverse counterparts of exemplars. We introduce semantic contrastive learning to enforce the generated samples to be semantic consistent with exemplars. Our method does not bring any extra inference cost and outperforms state-of-the-art methods on two benchmarks.
arXiv Detail & Related papers (2022-04-19T15:15:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.