Related papers: CaLMQA: Exploring culturally specific long-form question answering across 23 languages

CaLMQA: Exploring culturally specific long-form question answering across 23 languages

URL: http://arxiv.org/abs/2406.17761v3
Date: Wed, 11 Jun 2025 16:56:58 GMT
Title: CaLMQA: Exploring culturally specific long-form question answering across 23 languages
Authors: Shane Arora, Marzena Karpinska, Hung-Ting Chen, Ipsita Bhattacharjee, Mohit Iyyer, Eunsol Choi,
Abstract summary: CaLMQA is a dataset of 51.7K culturally specific questions across 23 different languages.<n>We evaluate factuality, relevance and surface-level quality of LLM-generated long-form answers.
Score: 58.18984409715615
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite rising global usage of large language models (LLMs), their ability to generate long-form answers to culturally specific questions remains unexplored in many languages. To fill this gap, we perform the first study of textual multilingual long-form QA by creating CaLMQA, a dataset of 51.7K culturally specific questions across 23 different languages. We define culturally specific questions as those that refer to concepts unique to one or a few cultures, or have different answers depending on the cultural or regional context. We obtain these questions by crawling naturally-occurring questions from community web forums in high-resource languages, and by hiring native speakers to write questions in under-resourced, rarely-studied languages such as Fijian and Kirundi. Our data collection methodologies are translation-free, enabling the collection of culturally unique questions like "Kuber iki umwami wa mbere w'uburundi yitwa Ntare?" (Kirundi; English translation: "Why was the first king of Burundi called Ntare (Lion)?"). We evaluate factuality, relevance and surface-level quality of LLM-generated long-form answers, finding that (1) for many languages, even the best models make critical surface-level errors (e.g., answering in the wrong language, repetition), especially for low-resource languages; and (2) answers to culturally specific questions contain more factual errors than answers to culturally agnostic questions -- questions that have consistent meaning and answer across many cultures. We release CaLMQA to facilitate future research in cultural and multilingual long-form QA.

Related papers

MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs [56.87573414161703]
We introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark to assess Large Language Models (LLMs)<n>MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance.<n>For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English.
arXiv Detail & Related papers (2025-07-23T12:56:31Z)
CALM: Unleashing the Cross-Lingual Self-Aligning Ability of Language Model Question Answering [42.92810049636768]
Large Language Models (LLMs) are pretrained on extensive multilingual corpora to acquire both language-specific cultural knowledge and general knowledge. We explore the Cross-Lingual Self-Aligning ability of Language Models (CALM) to align knowledge across languages. We employ direct preference optimization (DPO) to align the model's knowledge across different languages.
arXiv Detail & Related papers (2025-01-30T16:15:38Z)
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines [74.25764182510295]
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English.<n>We introduce World Cuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding.<n>This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points.
arXiv Detail & Related papers (2024-10-16T16:11:49Z)
OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context [4.39796591456426]
We present OMoS-QA, a dataset of German and English questions paired with trustworthy documents and manually annotated answers. Questions are automatically generated with an open-source large language model (LLM) and answer sentences are selected by crowd workers. We find high precision and low-to-mid recall in selecting answer sentences, which is a favorable trade-off to avoid misleading users.
arXiv Detail & Related papers (2024-07-22T15:40:17Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA) We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z)
MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages. MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice. We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z)
ELQA: A Corpus of Metalinguistic Questions and Answers about English [24.006858451437534]
Collected from two online forums, the >70k questions cover wide-ranging topics including grammar, meaning, fluency, and etymology. Unlike most NLP datasets, this corpus is metalinguistic -- it consists of language about language.
arXiv Detail & Related papers (2022-05-01T04:29:50Z)
Cross-Lingual GenQA: A Language-Agnostic Generative Question Answering Approach for Open-Domain Question Answering [76.99585451345702]
Open-Retrieval Generative Question Answering (GenQA) is proven to deliver high-quality, natural-sounding answers in English. We present the first generalization of the GenQA approach for the multilingual environment.
arXiv Detail & Related papers (2021-10-14T04:36:29Z)
XOR QA: Cross-lingual Open-Retrieval Question Answering [75.20578121267411]
This work extends open-retrieval question answering to a cross-lingual setting. We construct a large-scale dataset built on questions lacking same-language answers.
arXiv Detail & Related papers (2020-10-22T16:47:17Z)
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages [27.588857710802113]
TyDi QA is a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena.
arXiv Detail & Related papers (2020-03-10T21:11:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.