Related papers: BEnQA: A Question Answering and Reasoning Benchmark for Bengali and English

BEnQA: A Question Answering and Reasoning Benchmark for Bengali and English

URL: http://arxiv.org/abs/2403.10900v1
Date: Sat, 16 Mar 2024 11:27:42 GMT
Title: BEnQA: A Question Answering and Reasoning Benchmark for Bengali and English
Authors: Sheikh Shafayat, H M Quamran Hasan, Minhajur Rahman Chowdhury Mahim, Rifki Afina Putri, James Thorne, Alice Oh,
Abstract summary: We introduce BEnQA, a dataset comprising parallel Bengali and English exam questions for middle and high school levels in Bangladesh. Our dataset consists of approximately 5K questions covering several subjects in science with different types of questions, including factual, application, and reasoning-based questions. We benchmark several Large Language Models (LLMs) with our parallel dataset and observe a notable performance disparity between the models in Bengali and English.
Score: 18.217122567176585
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this study, we introduce BEnQA, a dataset comprising parallel Bengali and English exam questions for middle and high school levels in Bangladesh. Our dataset consists of approximately 5K questions covering several subjects in science with different types of questions, including factual, application, and reasoning-based questions. We benchmark several Large Language Models (LLMs) with our parallel dataset and observe a notable performance disparity between the models in Bengali and English. We also investigate some prompting methods, and find that Chain-of-Thought prompting is beneficial mostly on reasoning questions, but not so much on factual ones. We also find that appending English translation helps to answer questions in Bengali. Our findings point to promising future research directions for improving the performance of LLMs in Bengali and more generally in low-resource languages.

Related papers

ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects [4.2155105586549535]
We present ParamBench, consisting of more than 17K questions in the Hindi language, comprising questionnaires from 21 diverse subjects.<n>These questions are primarily derived from a nationwide graduate-level entrance examination covering topics such as history, music, instruments, yoga,ush, literature, philosophy, law, etc.<n>We evaluate the performance of more than 16 open source LLMs on this benchmark, observing that Gemma3-27B attains the highest overall accuracy of 56.4%.
arXiv Detail & Related papers (2025-08-22T07:59:37Z)
BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge [11.447710593895831]
BLUCK is a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge.<n>Our dataset comprises 2366 multiple-choice questions (MCQs)<n>We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3.
arXiv Detail & Related papers (2025-05-27T12:19:12Z)
BnMMLU: Measuring Massive Multitask Language Understanding in Bengali [0.0]
We introduce BnMMLU, a benchmark to evaluate the language understanding capabilities of Bengali in language models.<n>The dataset spans 23 domains, including science, humanities, mathematics and general knowledge.<n>We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set.
arXiv Detail & Related papers (2025-05-25T02:54:31Z)
Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models [0.0]
We provide a dataset of parallel Dzongkha and English test questions for Bhutanese middle and high school students.<n>The over 5K questions in our collection span a variety of scientific topics and include factual, application, and reasoning-based questions.<n>We use our parallel dataset to test a number of Large Language Models (LLMs) and find a significant performance difference between the models in English and Dzongkha.
arXiv Detail & Related papers (2025-05-24T11:01:05Z)
BanglaQuAD: A Bengali Open-domain Question Answering Dataset [6.228978072962629]
Bengali is the seventh most spoken language on earth, yet considered a low-resource language in the field of natural language processing (NLP) This paper introduces BanglaQuAD, a Bengali question answering dataset, containing 30,808 question-answer pairs constructed from Bengali Wikipedia articles by native speakers.
arXiv Detail & Related papers (2024-10-14T07:39:59Z)
How to Engage Your Readers? Generating Guiding Questions to Promote Active Reading [60.19226384241482]
We introduce GuidingQ, a dataset of 10K in-text questions from textbooks and scientific articles. We explore various approaches to generate such questions using language models. We conduct a human study to understand the implication of such questions on reading comprehension.
arXiv Detail & Related papers (2024-07-19T13:42:56Z)
Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs [2.309018557701645]
We aim to explore the question of whether there is a need for English-oriented Large Language Models dedicated to a low-resource language. We compare the performance of open-weight and closed-source LLMs against fine-tuned encoder-decoder models. Our findings reveal that while LLMs generally excel in reasoning tasks, their performance in tasks requiring Bengali script generation is inconsistent.
arXiv Detail & Related papers (2024-06-29T11:50:16Z)
CaLMQA: Exploring culturally specific long-form question answering across 23 languages [58.18984409715615]
CaLMQA is a collection of 1.5K culturally specific questions spanning 23 languages and 51 culturally translated questions from English into 22 other languages. We collect naturally-occurring questions from community web forums and hire native speakers to write questions to cover under-studied languages such as Fijian and Kirundi. Our dataset contains diverse, complex questions that reflect cultural topics (e.g. traditions, laws, news) and the language usage of native speakers.
arXiv Detail & Related papers (2024-06-25T17:45:26Z)
Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA) We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z)
Which questions should I answer? Salience Prediction of Inquisitive Questions [118.097974193544]
We show that highly salient questions are empirically more likely to be answered in the same article. We further validate our findings by showing that answering salient questions is an indicator of summarization quality in news.
arXiv Detail & Related papers (2024-04-16T21:33:05Z)
Loose LIPS Sink Ships: Asking Questions in Battleship with Language-Informed Program Sampling [80.64715784334936]
We study tradeoffs in a classic grounded question-asking task based on the board game Battleship. Our model uses large language models (LLMs) to generate natural language questions, translate them into symbolic programs, and evaluate their expected information gain. We find that with a surprisingly modest resource budget, this simple Monte Carlo optimization strategy yields informative questions that mirror human performance.
arXiv Detail & Related papers (2024-02-29T18:58:15Z)
Question Translation Training for Better Multilingual Reasoning [108.10066378240879]
Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English. A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translate-training. In this paper we explore the benefits of question alignment, where we train the model to translate reasoning questions into English by finetuning on X-English parallel question data.
arXiv Detail & Related papers (2024-01-15T16:39:10Z)
BenCoref: A Multi-Domain Dataset of Nominal Phrases and Pronominal Reference Annotations [0.0]
We introduce a new dataset, BenCoref, comprising coreference annotations for Bengali texts gathered from four distinct domains. This relatively small dataset contains 5200 mention annotations forming 502 mention clusters within 48,569 tokens.
arXiv Detail & Related papers (2023-04-07T15:08:46Z)
ELQA: A Corpus of Metalinguistic Questions and Answers about English [24.006858451437534]
Collected from two online forums, the >70k questions cover wide-ranging topics including grammar, meaning, fluency, and etymology. Unlike most NLP datasets, this corpus is metalinguistic -- it consists of language about language.
arXiv Detail & Related papers (2022-05-01T04:29:50Z)
Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation [6.2418269277908065]
Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources. We build a customized sentence segmenter for Bengali and propose two novel methods for parallel corpus creation on low-resource setups. With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2.75 million sentence pairs.
arXiv Detail & Related papers (2020-09-20T06:06:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.