Multilingual, Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset
- URL: http://arxiv.org/abs/2601.07985v1
- Date: Mon, 12 Jan 2026 20:33:46 GMT
- Title: Multilingual, Multimodal Pipeline for Creating Authentic and Structured Fact-Checked Claim Dataset
- Authors: Z. Melce Hüsünbeyi, Virginie Mouilleron, Leonie Uhling, Daniel Foppe, Tatjana Scheffler, Djamé Seddah,
- Abstract summary: This paper introduces a comprehensive data collection and processing pipeline that constructs multimodal fact-checking datasets in French and German languages.<n>We used state-of-the-art large language models (LLMs) and multimodal LLMs for (i) evidence extraction under predefined evidence categories and (ii) justification generation that links evidence to verdicts.
- Score: 3.1256048031872425
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid proliferation of misinformation across online platforms underscores the urgent need for robust, up-to-date, explainable, and multilingual fact-checking resources. However, existing datasets are limited in scope, often lacking multimodal evidence, structured annotations, and detailed links between claims, evidence, and verdicts. This paper introduces a comprehensive data collection and processing pipeline that constructs multimodal fact-checking datasets in French and German languages by aggregating ClaimReview feeds, scraping full debunking articles, normalizing heterogeneous claim verdicts, and enriching them with structured metadata and aligned visual content. We used state-of-the-art large language models (LLMs) and multimodal LLMs for (i) evidence extraction under predefined evidence categories and (ii) justification generation that links evidence to verdicts. Evaluation with G-Eval and human assessment demonstrates that our pipeline enables fine-grained comparison of fact-checking practices across different organizations or media markets, facilitates the development of more interpretable and evidence-grounded fact-checking models, and lays the groundwork for future research on multilingual, multimodal misinformation verification.
Related papers
- MultiCaption: Detecting disinformation using multilingual visual claims [10.69065586825833]
We present MultiCaption, a dataset specifically designed for detecting contradictions in visual claims.<n>The resulting dataset comprises 11,088 visual claims in 64 languages.<n>The gains from multilingual training and testing highlight the dataset's potential for building effective multilingual fact-checking pipelines.
arXiv Detail & Related papers (2026-01-16T11:57:07Z) - Multimodal Fact-Checking: An Agent-based Approach [9.55806677152407]
We introduce RW-Post, a high-quality and explainable dataset for real-world multimodal fact-checking.<n> RW-Post aligns real-world multimodal claims with their original social media posts, preserving the rich contextual information in which the claims are made.<n>Building upon RW-Post, we propose AgentFact, an agent-based multimodal fact-checking framework designed to emulate the human verification workflow.
arXiv Detail & Related papers (2025-12-28T13:58:33Z) - Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - Docopilot: Improving Multimodal Models for Document-Level Understanding [87.60020625241178]
We present a high-quality document-level dataset, Doc-750K, designed to support in-depth understanding of multimodal documents.<n>This dataset includes diverse document structures, extensive cross-page dependencies, and real question-answer pairs derived from the original documents.<n>Building on the dataset, we develop a native multimodal model, Docopilot, which can accurately handle document-level dependencies without relying on RAG.
arXiv Detail & Related papers (2025-07-19T16:03:34Z) - Towards Explainable Bilingual Multimodal Misinformation Detection and Localization [64.37162720126194]
BiMi is a framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis.<n>BiMiBench is a benchmark constructed by systematically editing real news images and subtitles.<n>BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore.
arXiv Detail & Related papers (2025-06-28T15:43:06Z) - Unstructured Evidence Attribution for Long Context Query Focused Summarization [53.08341620504465]
We propose to extract unstructured (i.e., spans of any length) evidence in order to acquire more relevant and consistent evidence than in the fixed granularity case.<n>We show how existing systems struggle to copy and properly cite unstructured evidence, which also tends to be "lost-in-the-middle"
arXiv Detail & Related papers (2025-02-20T09:57:42Z) - Multimodal Misinformation Detection using Large Vision-Language Models [7.505532091249881]
Large language models (LLMs) have shown remarkable performance in various tasks.
Few approaches consider evidence retrieval as part of misinformation detection.
We propose a novel re-ranking approach for multimodal evidence retrieval.
arXiv Detail & Related papers (2024-07-19T13:57:11Z) - Give Me More Details: Improving Fact-Checking with Latent Retrieval [58.706972228039604]
Evidence plays a crucial role in automated fact-checking.
Existing fact-checking systems either assume the evidence sentences are given or use the search snippets returned by the search engine.
We propose to incorporate full text from source documents as evidence and introduce two enriched datasets.
arXiv Detail & Related papers (2023-05-25T15:01:19Z) - CHEF: A Pilot Chinese Dataset for Evidence-Based Fact-Checking [55.75590135151682]
CHEF is the first CHinese Evidence-based Fact-checking dataset of 10K real-world claims.
The dataset covers multiple domains, ranging from politics to public health, and provides annotated evidence retrieved from the Internet.
arXiv Detail & Related papers (2022-06-06T09:11:03Z) - End-to-End Multimodal Fact-Checking and Explanation Generation: A
Challenging Dataset and Models [0.0]
We propose end-to-end multimodal fact-checking and explanation generation.
The goal is to assess the truthfulness of a claim by retrieving relevant evidence and predicting a truthfulness label.
To support this research, we construct Mocheg, a large-scale dataset consisting of 15,601 claims.
arXiv Detail & Related papers (2022-05-25T04:36:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.