BanglaQuAD: A Bengali Open-domain Question Answering Dataset
- URL: http://arxiv.org/abs/2410.10229v1
- Date: Mon, 14 Oct 2024 07:39:59 GMT
- Title: BanglaQuAD: A Bengali Open-domain Question Answering Dataset
- Authors: Md Rashad Al Hasan Rony, Sudipto Kumar Shaha, Rakib Al Hasan, Sumon Kanti Dey, Amzad Hossain Rafi, Amzad Hossain Rafi, Ashraf Hasan Sirajee, Jens Lehmann,
- Abstract summary: Bengali is the seventh most spoken language on earth, yet considered a low-resource language in the field of natural language processing (NLP)
This paper introduces BanglaQuAD, a Bengali question answering dataset, containing 30,808 question-answer pairs constructed from Bengali Wikipedia articles by native speakers.
- Score: 6.228978072962629
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bengali is the seventh most spoken language on earth, yet considered a low-resource language in the field of natural language processing (NLP). Question answering over unstructured text is a challenging NLP task as it requires understanding both question and passage. Very few researchers attempted to perform question answering over Bengali (natively pronounced as Bangla) text. Typically, existing approaches construct the dataset by directly translating them from English to Bengali, which produces noisy and improper sentence structures. Furthermore, they lack topics and terminologies related to the Bengali language and people. This paper introduces BanglaQuAD, a Bengali question answering dataset, containing 30,808 question-answer pairs constructed from Bengali Wikipedia articles by native speakers. Additionally, we propose an annotation tool that facilitates question-answering dataset construction on a local machine. A qualitative analysis demonstrates the quality of our proposed dataset.
Related papers
- BNLI: A Linguistically-Refined Bengali Dataset for Natural Language Inference [1.7688536690159165]
Existing Bengali NLI datasets exhibit several inconsistencies, including annotation errors, ambiguous sentence pairs, and inadequate linguistic diversity.<n>We introduce BNLI, a refined and linguistically curated Bengali NLI dataset designed to support robust language understanding and inference modeling.<n>We benchmarked BNLI using a suite of state-of-the-art transformer-based architectures, including multilingual and Bengali-specific models, to assess their ability to capture complex semantic relations.
arXiv Detail & Related papers (2025-11-11T22:29:14Z) - Bangla-Bayanno: A 52K-Pair Bengali Visual Question Answering Dataset with LLM-Assisted Translation Refinement [0.0]
We introduce Bangla-Bayanno, an open-ended Visual Question Answering (VQA) dataset in Bangla.<n>The dataset comprises 52,650 question-answer pairs across 4750+ images.
arXiv Detail & Related papers (2025-08-27T13:48:04Z) - CaLMQA: Exploring culturally specific long-form question answering across 23 languages [58.18984409715615]
CaLMQA is a collection of 1.5K culturally specific questions spanning 23 languages and 51 culturally translated questions from English into 22 other languages.
We collect naturally-occurring questions from community web forums and hire native speakers to write questions to cover under-studied languages such as Fijian and Kirundi.
Our dataset contains diverse, complex questions that reflect cultural topics (e.g. traditions, laws, news) and the language usage of native speakers.
arXiv Detail & Related papers (2024-06-25T17:45:26Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering [0.4194295877935868]
This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages.
We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples.
arXiv Detail & Related papers (2024-04-20T12:16:35Z) - BEnQA: A Question Answering and Reasoning Benchmark for Bengali and English [18.217122567176585]
We introduce BEnQA, a dataset comprising parallel Bengali and English exam questions for middle and high school levels in Bangladesh.
Our dataset consists of approximately 5K questions covering several subjects in science with different types of questions, including factual, application, and reasoning-based questions.
We benchmark several Large Language Models (LLMs) with our parallel dataset and observe a notable performance disparity between the models in Bengali and English.
arXiv Detail & Related papers (2024-03-16T11:27:42Z) - BenCoref: A Multi-Domain Dataset of Nominal Phrases and Pronominal
Reference Annotations [0.0]
We introduce a new dataset, BenCoref, comprising coreference annotations for Bengali texts gathered from four distinct domains.
This relatively small dataset contains 5200 mention annotations forming 502 mention clusters within 48,569 tokens.
arXiv Detail & Related papers (2023-04-07T15:08:46Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - A Dataset of Information-Seeking Questions and Answers Anchored in
Research Papers [66.11048565324468]
We present a dataset of 5,049 questions over 1,585 Natural Language Processing papers.
Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text.
We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers.
arXiv Detail & Related papers (2021-05-07T00:12:34Z) - Simple or Complex? Learning to Predict Readability of Bengali Texts [6.860272388539321]
We present a readability analysis tool capable of analyzing text written in the Bengali language.
Despite being the 7th most spoken language in the world with 230 million native speakers, Bengali suffers from a lack of fundamental resources for natural language processing.
arXiv Detail & Related papers (2020-12-09T01:41:35Z) - XOR QA: Cross-lingual Open-Retrieval Question Answering [75.20578121267411]
This work extends open-retrieval question answering to a cross-lingual setting.
We construct a large-scale dataset built on questions lacking same-language answers.
arXiv Detail & Related papers (2020-10-22T16:47:17Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z) - Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New
Datasets for Bengali-English Machine Translation [6.2418269277908065]
Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources.
We build a customized sentence segmenter for Bengali and propose two novel methods for parallel corpus creation on low-resource setups.
With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2.75 million sentence pairs.
arXiv Detail & Related papers (2020-09-20T06:06:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.