Related papers: MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models

MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models

URL: http://arxiv.org/abs/2602.16298v1
Date: Wed, 18 Feb 2026 09:28:53 GMT
Title: MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models
Authors: Martin Hyben, Sebastian Kula, Jan Cegin, Jakub Simko, Ivan Srba, Robert Moro,
Abstract summary: Multi-Check-Worthy dataset spans 16 languages, 7 topical domains, and 2 writing styles.<n>It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages.
Score: 6.382707047064603
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims a key step in the fact-checking process remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles. It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. To probe robustness, we also introduce an equally balanced out-of-distribution evaluation set of 27,761 samples in 4 additional languages. To provide baselines, we benchmark 3 common fine-tuned multilingual transformers against a diverse set of 15 commercial and open LLMs under zero-shot settings. Our findings show that fine-tuned models consistently outperform zero-shot LLMs on claim classification and show strong out-of-distribution generalization across languages, domains, and styles. MultiCW provides a rigorous multilingual resource for advancing automated fact-checking and enables systematic comparisons between fine-tuned models and cutting-edge LLMs on the check-worthy claim detection task.

Related papers

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification [14.187153195380668]
Large language models have remarkable capabilities across many NLP tasks, but their effectiveness for multilingual claim verification with nuanced classification schemes remains understudied.<n>We evaluate five state-of-the-art language models on the X-Fact dataset, which spans 25 languages with seven distinct veracity categories.<n>Surprisingly, we find that XLM-R substantially outperforms all tested LLMs, achieving 57.7% macro-F1 compared to the best LLM performance of 16.9%.
arXiv Detail & Related papers (2025-07-28T10:49:04Z)
PolyPrompt: Automating Knowledge Extraction from Multilingual Language Models with Dynamic Prompt Generation [0.0]
We introduce PolyPrompt, a novel, parameter-efficient framework for enhancing the multilingual capabilities of large language models (LLMs)<n>Our method learns a set of trigger tokens for each language through a gradient-based search, identifying the input query's language and selecting the corresponding trigger tokens which are prepended to the prompt during inference.<n>We perform experiments on two 1 billion parameter models, with evaluations on the global MMLU benchmark across fifteen typologically and resource diverse languages, demonstrating accuracy gains of 3.7%-19.9% compared to naive and translation-pipeline baselines.
arXiv Detail & Related papers (2025-02-27T04:41:22Z)
MMTEB: Massive Multilingual Text Embedding Benchmark [85.18187649328792]
We introduce the Massive Multilingual Text Embedding Benchmark (MMTEB)<n>MMTEB covers over 500 quality-controlled evaluation tasks across 250+ languages.<n>We develop several highly multilingual benchmarks, which we use to evaluate a representative set of models.
arXiv Detail & Related papers (2025-02-19T10:13:43Z)
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [51.18383180774354]
We introduce Multi-IF, a new benchmark designed to assess Large Language Models' proficiency in following multi-turn and multilingual instructions. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities.
arXiv Detail & Related papers (2024-10-21T00:59:47Z)
Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset [13.041053110012246]
We introduce a statistical test, the Preference Proportion Test, for identifying such unreliable subsets. We find that filtering this low-quality data out when training models for the downstream task of phonetic transcription brings substantial benefits.
arXiv Detail & Related papers (2024-10-05T21:41:49Z)
On the Calibration of Multilingual Question Answering LLMs [57.296161186129545]
We benchmark the calibration of several multilingual Large Language Models (MLLMs) on a variety of Question Answering tasks. We study different dimensions of calibration in in-distribution, out-of-distribution, and cross-lingual transfer settings. For decoder-only LLMs such as LlaMa2, we additionally find that in-context learning improves confidence calibration on multilingual data.
arXiv Detail & Related papers (2023-11-15T03:29:02Z)
Multilingual and Multi-topical Benchmark of Fine-tuned Language models and Large Language Models for Check-Worthy Claim Detection [1.4779899760345434]
This study compares the performance of (1) fine-tuned language models and (2) large language models on the task of check-worthy claim detection. We composed a multilingual and multi-topical dataset comprising texts of various sources and styles.
arXiv Detail & Related papers (2023-11-10T15:36:35Z)
Self-Checker: Plug-and-Play Modules for Fact-Checking with Large Language Models [75.75038268227554]
Self-Checker is a framework comprising a set of plug-and-play modules that facilitate fact-checking. This framework provides a fast and efficient way to construct fact-checking systems in low-resource environments.
arXiv Detail & Related papers (2023-05-24T01:46:07Z)
Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models [55.35106713257871]
We introduce DecoMT, a novel approach of few-shot prompting that decomposes the translation process into a sequence of word chunk translations. We show that DecoMT outperforms the strong few-shot prompting BLOOM model with an average improvement of 8 chrF++ scores across the examined languages.
arXiv Detail & Related papers (2023-05-22T14:52:47Z)
Interpretable Unified Language Checking [42.816372695828306]
We present an interpretable, unified, language checking (UniLC) method for both human and machine-generated language. We find that LLMs can achieve high performance on a combination of fact-checking, stereotype detection, and hate speech detection tasks.
arXiv Detail & Related papers (2023-04-07T16:47:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.