Related papers: XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering

XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering

URL: http://arxiv.org/abs/2508.16139v1
Date: Fri, 22 Aug 2025 07:00:13 GMT
Title: XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering
Authors: Keon-Woo Roh, Yeong-Joon Ju, Seong-Whan Lee,
Abstract summary: Large Language Models (LLMs) have shown significant progress in Open-domain question answering (ODQA)<n>Most evaluations focus on English and assume locale-invariant answers across languages.<n>We introduce XLQA, a novel benchmark explicitly designed for locale-sensitive multilingual ODQA.
Score: 48.913480244527925
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have shown significant progress in Open-domain question answering (ODQA), yet most evaluations focus on English and assume locale-invariant answers across languages. This assumption neglects the cultural and regional variations that affect question understanding and answer, leading to biased evaluation in multilingual benchmarks. To address these limitations, we introduce XLQA, a novel benchmark explicitly designed for locale-sensitive multilingual ODQA. XLQA contains 3,000 English seed questions expanded to eight languages, with careful filtering for semantic consistency and human-verified annotations distinguishing locale-invariant and locale-sensitive cases. Our evaluation of five state-of-the-art multilingual LLMs reveals notable failures on locale-sensitive questions, exposing gaps between English and other languages due to a lack of locale-grounding knowledge. We provide a systematic framework and scalable methodology for assessing multilingual QA under diverse cultural contexts, offering a critical resource to advance the real-world applicability of multilingual ODQA systems. Our findings suggest that disparities in training data distribution contribute to differences in both linguistic competence and locale-awareness across models.

Related papers

MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages [33.450081592217074]
We introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities.<n>We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage.
arXiv Detail & Related papers (2025-06-24T09:53:00Z)
High-Dimensional Interlingual Representations of Large Language Models [65.77317753001954]
Large language models (LLMs) trained on massive multilingual datasets hint at the formation of interlingual constructs.<n>We explore 31 diverse languages varying on their resource-levels, typologies, and geographical regions.<n>We find that multilingual LLMs exhibit inconsistent cross-lingual alignments.
arXiv Detail & Related papers (2025-03-14T10:39:27Z)
XIFBench: Evaluating Large Language Models on Multilingual Instruction Following [34.21958956053967]
Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications.<n>XIFBench is a constraint-based benchmark for assessing multilingual instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-03-10T17:07:52Z)
CALM: Unleashing the Cross-Lingual Self-Aligning Ability of Language Model Question Answering [42.92810049636768]
Large Language Models (LLMs) are pretrained on extensive multilingual corpora to acquire both language-specific cultural knowledge and general knowledge.<n>We explore the Cross-Lingual Self-Aligning ability of Language Models (CALM) to align knowledge across languages.<n>We employ direct preference optimization (DPO) to align the model's knowledge across different languages.
arXiv Detail & Related papers (2025-01-30T16:15:38Z)
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge [36.234295907476515]
The development of functional large language models (LLM) is bottlenecked by the lack of high-quality evaluation resources in languages other than English.<n>In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts.
arXiv Detail & Related papers (2024-11-29T16:03:14Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
Delving Deeper into Cross-lingual Visual Question Answering [115.16614806717341]
We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance. We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers.
arXiv Detail & Related papers (2022-02-15T18:22:18Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.