CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation
- URL: http://arxiv.org/abs/2508.07295v1
- Date: Sun, 10 Aug 2025 11:09:41 GMT
- Title: CCFQA: A Benchmark for Cross-Lingual and Cross-Modal Speech and Text Factuality Evaluation
- Authors: Yexing Du, Kaiyuan Liu, Youcheng Pan, Zheng Chu, Bo Yang, Xiaocheng Feng, Yang Xiang, Ming Liu,
- Abstract summary: CCFQA benchmark contains parallel speech-text factual questions across 8 languages.<n>Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark.<n>We propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks.
- Score: 26.054199546779696
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: As Large Language Models (LLMs) are increasingly popularized in the multilingual world, ensuring hallucination-free factuality becomes markedly crucial. However, existing benchmarks for evaluating the reliability of Multimodal Large Language Models (MLLMs) predominantly focus on textual or visual modalities with a primary emphasis on English, which creates a gap in evaluation when processing multilingual input, especially in speech. To bridge this gap, we propose a novel \textbf{C}ross-lingual and \textbf{C}ross-modal \textbf{F}actuality benchmark (\textbf{CCFQA}). Specifically, the CCFQA benchmark contains parallel speech-text factual questions across 8 languages, designed to systematically evaluate MLLMs' cross-lingual and cross-modal factuality capabilities. Our experimental results demonstrate that current MLLMs still face substantial challenges on the CCFQA benchmark. Furthermore, we propose a few-shot transfer learning strategy that effectively transfers the Question Answering (QA) capabilities of LLMs in English to multilingual Spoken Question Answering (SQA) tasks, achieving competitive performance with GPT-4o-mini-Audio using just 5-shot training. We release CCFQA as a foundational research resource to promote the development of MLLMs with more robust and reliable speech understanding capabilities. Our code and dataset are available at https://github.com/yxduir/ccfqa.
Related papers
- JobResQA: A Benchmark for LLM Machine Reading Comprehension on Multilingual Résumés and JDs [3.83467384247581]
JobResQA is a benchmark for evaluating Machine Reading (MRC) capabilities on HR-specific tasks.<n>The dataset comprises 581 QA pairs across 105 résumé-job description pairs in five languages.
arXiv Detail & Related papers (2026-01-30T17:06:59Z) - Multimodal In-context Learning for ASR of Low-resource Languages [16.078416187950207]
In-context learning (ICL) with large language models (LLMs) addresses this problem.<n>This paper investigates whether speech LLMs can learn unseen languages with multimodal ICL (MICL)<n>Cross-lingual transfer learning improves MICL efficiency on target languages without training on them.
arXiv Detail & Related papers (2026-01-09T10:52:23Z) - MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks [25.75895667904485]
We introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks.<n>MCF spans three core modalities--speech, vision, and text--and four diverse languages (English, German, Italian, and Chinese)<n>It enables a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information.
arXiv Detail & Related papers (2025-07-25T19:00:51Z) - mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks [11.996399504336624]
We introduce mSTEB, a new benchmark to evaluate the performance of large language models (LLMs) on a wide range of tasks.<n>We evaluate the performance of leading LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open models such as Qwen 2 Audio and Gemma 3 27B.
arXiv Detail & Related papers (2025-06-10T03:15:08Z) - SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs [12.60449414234283]
SpokenNativQA is the first multilingual and culturally aligned spoken question-answering dataset.<n>The dataset comprises approximately 33,000 naturally spoken questions and answers in multiple languages.
arXiv Detail & Related papers (2025-05-25T14:22:18Z) - Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z) - Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From [61.63091726904068]
We evaluate the cross-lingual context retrieval ability of over 40 large language models (LLMs) across 12 languages.<n>Several small, post-trained open LLMs show strong cross-lingual context retrieval ability.<n>Our results also indicate that larger-scale pretraining cannot improve the xMRC performance.
arXiv Detail & Related papers (2025-04-15T06:35:27Z) - On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation [12.848952248427977]
Retrieval-augmented generation (RAG) with large language models (LLMs) has demonstrated strong performance in multilingual question-answering tasks.<n>In multilingual RAG, retrieved passages can be written in languages other than that of the query entered by the user.
arXiv Detail & Related papers (2025-04-01T09:55:23Z) - MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [86.7047714187813]
MMLU-ProX is a benchmark covering 29 languages, built on an English benchmark.<n>Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons.<n>To meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
arXiv Detail & Related papers (2025-03-13T15:59:20Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.<n>Currently, instruction-tuned large language models (LLMs) excel at various English tasks.<n>Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We introduce ReDial, a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE.<n>We evaluate widely used models, including GPT, Claude, Llama, Mistral, and the Phi model families.<n>Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries.
arXiv Detail & Related papers (2024-10-14T18:44:23Z) - MELA: Multilingual Evaluation of Linguistic Acceptability [7.524375463656369]
We present the largest benchmark to date on linguistic acceptability: Multilingual Evaluation of Linguistic Acceptability -- MELA, with 46K samples covering 10 languages.
In pursuit of multilingual interpretability, we conduct probing experiments with fine-tuned XLM-R.
Cross-lingual transfer experiments show that transfer in acceptability judgment is non-trivial.
arXiv Detail & Related papers (2023-11-15T15:25:28Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.