Related papers: UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking

UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking

URL: http://arxiv.org/abs/2505.15063v1
Date: Wed, 21 May 2025 03:31:44 GMT
Title: UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking
Authors: Sarfraz Ahmad, Hasan Iqbal, Momina Ahsan, Numaan Naeem, Muhammad Ahsan Riaz Khan, Arham Riaz, Muhammad Arslan Manzoor, Yuxia Wang, Preslav Nakov,
Abstract summary: Existing automated fact-checking solutions overwhelmingly focus on English, leaving a significant gap for the 200+ million Urdu speakers worldwide.<n>We introduce UrduFactCheck, the first comprehensive, modular fact-checking framework specifically tailored for Urdu.<n>Our system features a dynamic, multi-strategy evidence retrieval pipeline that combines monolingual and translation-based approaches.
Score: 23.83465391929839
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid use of large language models (LLMs) has raised critical concerns regarding the factual reliability of their outputs, especially in low-resource languages such as Urdu. Existing automated fact-checking solutions overwhelmingly focus on English, leaving a significant gap for the 200+ million Urdu speakers worldwide. In this work, we introduce UrduFactCheck, the first comprehensive, modular fact-checking framework specifically tailored for Urdu. Our system features a dynamic, multi-strategy evidence retrieval pipeline that combines monolingual and translation-based approaches to address the scarcity of high-quality Urdu evidence. We curate and release two new hand-annotated benchmarks: UrduFactBench for claim verification and UrduFactQA for evaluating LLM factuality. Extensive experiments demonstrate that UrduFactCheck, particularly its translation-augmented variants, consistently outperforms baselines and open-source alternatives on multiple metrics. We further benchmark twelve state-of-the-art (SOTA) LLMs on factual question answering in Urdu, highlighting persistent gaps between proprietary and open-source models. UrduFactCheck's code and datasets are open-sourced and publicly available at https://github.com/mbzuai-nlp/UrduFactCheck.

Related papers

UrBLiMP: A Benchmark for Evaluating the Linguistic Competence of Large Language Models in Urdu [12.952822154200497]
We present the Urdu Benchmark of Linguistic Minimal Pairs (UrBLiMP)<n>UrBLiMP comprises 5,696 minimal pairs targeting ten core syntactic phenomena.<n>A human evaluation of UrBLiMP annotations yielded a 96.10% inter-annotator agreement.
arXiv Detail & Related papers (2025-08-01T18:16:37Z)
Unified Large Language Models for Misinformation Detection in Low-Resource Linguistic Settings [1.5811829698567754]
There is a noticeable gap in resources and strategies to detect news in regional languages, such as Urdu.<n>Current Urdu fake news datasets are often domain-specific and inaccessible to the public.<n>This paper presents the first benchmark large FND dataset for Urdu news, which is publicly available for validation and deep analysis.
arXiv Detail & Related papers (2025-06-02T12:19:28Z)
Poly-FEVER: A Multilingual Fact Verification Benchmark for Hallucination Detection in Large Language Models [10.663446796160567]
Hallucinations in generative AI, particularly in Large Language Models (LLMs), pose a significant challenge to the reliability of multilingual applications.<n>Existing benchmarks for hallucination detection focus primarily on English and a few widely spoken languages.<n>We introduce Poly-FEVER, a large-scale multilingual fact verification benchmark.
arXiv Detail & Related papers (2025-03-19T01:46:09Z)
UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings [0.7874708385247353]
This paper introduces UrduLLaMA 1.0, a model derived from the open-source Llama-3.1-8B-Instruct architecture.<n>We leverage Low-Rank Adaptation (LoRA) to fine tune the model on 41,000 Urdu instructions and approximately 50,000 English-Urdu translation pairs.
arXiv Detail & Related papers (2025-02-24T08:38:21Z)
INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages [25.402797722575805]
Indic QA Benchmark is a dataset for context grounded question answering in 11 major Indian languages.<n> Evaluations revealed weak performance in low resource languages due to a strong English language bias in their training data.<n>We also investigated the Translate Test paradigm,where inputs are translated to English for processing and the results are translated back into the source language for output.
arXiv Detail & Related papers (2024-07-18T13:57:16Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.<n>We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.<n>We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
Do We Need Language-Specific Fact-Checking Models? The Case of Chinese [15.619421104102516]
This paper investigates the potential benefits of language-specific fact-checking models, focusing on the case of Chinese. We first demonstrate the limitations of translation-based methods and multilingual large language models, highlighting the need for language-specific systems. We propose a Chinese fact-checking system that can better retrieve evidence from a document by incorporating context information.
arXiv Detail & Related papers (2024-01-27T20:26:03Z)
YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters. YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline. The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z)
GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z)
Multimodal Chain-of-Thought Reasoning in Language Models [94.70184390935661]
We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2023-02-02T07:51:19Z)
mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries. We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z)
An Extensible Plug-and-Play Method for Multi-Aspect Controllable Text Generation [70.77243918587321]
Multi-aspect controllable text generation that controls generated text in multiple aspects has attracted increasing attention. We provide a theoretical lower bound for the interference and empirically found that the interference grows with the number of layers where prefixes are inserted. We propose using trainable gates to normalize the intervention of prefixes to restrain the growing interference.
arXiv Detail & Related papers (2022-12-19T11:53:59Z)
X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge. However, studies on LMs' factual representation ability have almost invariably been performed on English. We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.