From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation
- URL: http://arxiv.org/abs/2505.18685v1
- Date: Sat, 24 May 2025 13:04:23 GMT
- Title: From Generation to Detection: A Multimodal Multi-Task Dataset for Benchmarking Health Misinformation
- Authors: Zhihao Zhang, Yiran Zhang, Xiyue Zhou, Liting Huang, Imran Razzak, Preslav Nakov, Usman Naseem,
- Abstract summary: We present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news articles.<n> MM Health includes human-generated multimodal information (5,776 articles) and AI generated multimodal information (28,880 articles) from various SOTA generative AI models.
- Score: 40.226443705818404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Infodemics and health misinformation have significant negative impact on individuals and society, exacerbating confusion and increasing hesitancy in adopting recommended health measures. Recent advancements in generative AI, capable of producing realistic, human like text and images, have significantly accelerated the spread and expanded the reach of health misinformation, resulting in an alarming surge in its dissemination. To combat the infodemics, most existing work has focused on developing misinformation datasets from social media and fact checking platforms, but has faced limitations in topical coverage, inclusion of AI generation, and accessibility of raw content. To address these issues, we present MM Health, a large scale multimodal misinformation dataset in the health domain consisting of 34,746 news article encompassing both textual and visual information. MM Health includes human-generated multimodal information (5,776 articles) and AI generated multimodal information (28,880 articles) from various SOTA generative AI models. Additionally, We benchmarked our dataset against three tasks (reliability checks, originality checks, and fine-grained AI detection) demonstrating that existing SOTA models struggle to accurately distinguish the reliability and origin of information. Our dataset aims to support the development of misinformation detection across various health scenarios, facilitating the detection of human and machine generated content at multimodal levels.
Related papers
- Information Retrieval in the Age of Generative AI: The RGB Model [77.96475639967431]
This paper presents a novel quantitative approach to shed light on the complex information dynamics arising from the growing use of generative AI tools.<n>We propose a model to characterize the generation, indexing, and dissemination of information in response to new topics.<n>Our findings suggest that the rapid pace of generative AI adoption, combined with increasing user reliance, can outpace human verification, escalating the risk of inaccurate information proliferation.
arXiv Detail & Related papers (2025-04-29T10:21:40Z) - A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI [70.06771291117965]
We introduce Biomedica, an open-source dataset derived from the PubMed Central Open Access subset.<n>Biomedica contains over 6 million scientific articles and 24 million image-text pairs.<n>We provide scalable streaming and search APIs through a web server, facilitating seamless integration with AI systems.
arXiv Detail & Related papers (2025-03-26T05:56:46Z) - MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine [53.01393667775077]
This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine.<n>It covers over 25 million images across 10 modalities with multigranular annotations for more than 65 diseases.<n>Unlike the existing multimodal datasets, which are limited by the availability of image-text pairs, we have developed the first automated pipeline.
arXiv Detail & Related papers (2024-08-06T02:09:35Z) - RU-AI: A Large Multimodal Dataset for Machine-Generated Content Detection [11.265512559447986]
We introduce RU-AI, a new large-scale multimodal dataset for robust and effective detection of machine-generated content in text, image and voice.<n>Our dataset is constructed on the basis of three large publicly available datasets: Flickr8K, COCO and Places205.<n>The results reveal that existing models still struggle to achieve accurate and robust detection on our dataset.
arXiv Detail & Related papers (2024-06-07T12:58:14Z) - Detecting Multimedia Generated by Large AI Models: A Survey [25.97663040910416]
The aim of this survey is to fill an academic gap and contribute to global AI security efforts.<n>We introduce a novel taxonomy for detection methods, categorized by media modality.<n>We offer a focused analysis from a social media perspective to highlight their broader societal impact.
arXiv Detail & Related papers (2024-01-22T15:08:19Z) - Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated
Misinformation in the Medical Domain [14.837495995122598]
Med-MMHL is a novel multi-modal misinformation detection dataset in a general medical domain encompassing multiple diseases.
Our dataset aims to facilitate comprehensive research and development of methodologies for detecting misinformation across diverse diseases and various scenarios.
arXiv Detail & Related papers (2023-06-15T05:59:11Z) - DeepfakeArt Challenge: A Benchmark Dataset for Generative AI Art Forgery and Data Poisoning Detection [57.51313366337142]
There has been growing concern over the use of generative AI for malicious purposes.
In the realm of visual content synthesis using generative AI, key areas of significant concern has been image forgery and data poisoning.
We introduce the DeepfakeArt Challenge, a large-scale challenge benchmark dataset designed specifically to aid in the building of machine learning algorithms for generative AI art forgery and data poisoning detection.
arXiv Detail & Related papers (2023-06-02T05:11:27Z) - BAND: Biomedical Alert News Dataset [34.277782189514134]
We introduce the Biomedical Alert News dataset (BAND), which includes 1,508 samples from existing reported news articles, open emails, and alerts, as well as 30 epidemiology-related questions.
The BAND dataset brings new challenges to the NLP world, requiring better disguise capability of the content and the ability to infer important information.
To the best of our knowledge, the BAND corpus is the largest corpus of well-annotated biomedical outbreak alert news with elaborately designed questions.
arXiv Detail & Related papers (2023-05-23T19:21:00Z) - Edited Media Understanding Frames: Reasoning About the Intent and Implications of Visual Misinformation [62.68385635551825]
Multimodal disinformation, from 'deepfakes' to simple edits that deceive, is an important societal problem.<n>The difference between this example, and harmful edits that spread disinformation, is one of intent.<n> Recognizing and describing this intent is a major challenge for today's AI systems.
arXiv Detail & Related papers (2020-12-08T20:30:43Z) - Disinformation in the Online Information Ecosystem: Detection,
Mitigation and Challenges [35.0667998623823]
A large fraction of the common public turn to social media platforms for news and even information regarding highly concerning issues such as COVID-19 symptoms.
There is a significant amount of ongoing research in the directions of disinformation detection and mitigation.
We discuss the online disinformation problem, focusing on the recent 'infodemic' in the wake of the coronavirus pandemic.
arXiv Detail & Related papers (2020-10-18T21:44:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.