Pir\'a: A Bilingual Portuguese-English Dataset for Question-Answering
about the Ocean
- URL: http://arxiv.org/abs/2202.02398v1
- Date: Fri, 4 Feb 2022 21:29:45 GMT
- Title: Pir\'a: A Bilingual Portuguese-English Dataset for Question-Answering
about the Ocean
- Authors: Andr\'e F. A. Paschoal, Paulo Pirozelli, Valdinei Freire, Karina V.
Delgado, Sarajane M. Peres, Marcos M. Jos\'e, Fl\'avio Nakasato, Andr\'e S.
Oliveira, Anarosa A. F. Brand\~ao, Anna H. R. Costa, Fabio G. Cozman
- Abstract summary: This paper presents the Pir'a dataset, a large set of questions and answers about the ocean and the Brazilian coast both in Portuguese and English.
The Pir'a dataset consists of 2261 properly curated question/answer (QA) sets in both languages.
We discuss some of the advantages as well as limitations of Pir'a, as this new resource can support a set of tasks in NLP such as question-answering, information retrieval, and machine translation.
- Score: 1.1837802026343334
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Current research in natural language processing is highly dependent on
carefully produced corpora. Most existing resources focus on English; some
resources focus on languages such as Chinese and French; few resources deal
with more than one language. This paper presents the Pir\'a dataset, a large
set of questions and answers about the ocean and the Brazilian coast both in
Portuguese and English. Pir\'a is, to the best of our knowledge, the first QA
dataset with supporting texts in Portuguese, and, perhaps more importantly, the
first bilingual QA dataset that includes this language. The Pir\'a dataset
consists of 2261 properly curated question/answer (QA) sets in both languages.
The QA sets were manually created based on two corpora: abstracts related to
the Brazilian coast and excerpts of United Nation reports about the ocean. The
QA sets were validated in a peer-review process with the dataset contributors.
We discuss some of the advantages as well as limitations of Pir\'a, as this new
resource can support a set of tasks in NLP such as question-answering,
information retrieval, and machine translation.
Related papers
- INDIC QA BENCHMARK: A Multilingual Benchmark to Evaluate Question Answering capability of LLMs for Indic Languages [26.13077589552484]
Indic-QA is the largest publicly available context-grounded question-answering dataset for 11 major Indian languages from two language families.
We generate a synthetic dataset using the Gemini model to create question-answer pairs given a passage, which is then manually verified for quality assurance.
We evaluate various multilingual Large Language Models and their instruction-fine-tuned variants on the benchmark and observe that their performance is subpar, particularly for low-resource languages.
arXiv Detail & Related papers (2024-07-18T13:57:16Z) - Datasets for Multilingual Answer Sentence Selection [59.28492975191415]
We introduce new high-quality datasets for AS2 in five European languages (French, German, Italian, Portuguese, and Spanish)
Results indicate that our datasets are pivotal in producing robust and powerful multilingual AS2 models.
arXiv Detail & Related papers (2024-06-14T16:50:29Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - Benchmarks for Pir\'a 2.0, a Reading Comprehension Dataset about the
Ocean, the Brazilian Coast, and Climate Change [0.24091079613649843]
Pir'a is a reading comprehension dataset focused on the ocean, the Brazilian coast, and climate change.
This dataset represents a versatile language resource, particularly useful for testing the ability of current machine learning models to acquire expert scientific knowledge.
arXiv Detail & Related papers (2023-09-19T21:56:45Z) - Evaluating and Modeling Attribution for Cross-Lingual Question Answering [80.4807682093432]
This work is the first to study attribution for cross-lingual question answering.
We collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system.
We find that a substantial portion of the answers is not attributable to any retrieved passages.
arXiv Detail & Related papers (2023-05-23T17:57:46Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Cross-Lingual Question Answering over Knowledge Base as Reading
Comprehension [61.079852289005025]
Cross-lingual question answering over knowledge base (xKBQA) aims to answer questions in languages different from that of the provided knowledge base.
One of the major challenges facing xKBQA is the high cost of data annotation.
We propose a novel approach for xKBQA in a reading comprehension paradigm.
arXiv Detail & Related papers (2023-02-26T05:52:52Z) - A Chinese Multi-type Complex Questions Answering Dataset over Wikidata [45.31495982252219]
Complex Knowledge Base Question Answering is a popular area of research in the past decade.
Recent public datasets have led to encouraging results in this field, but are mostly limited to English.
Few state-of-the-art KBQA models are trained on Wikidata, one of the most popular real-world knowledge bases.
We propose CLC-QuAD, the first large scale complex Chinese semantic parsing dataset over Wikidata to address these challenges.
arXiv Detail & Related papers (2021-11-11T07:39:16Z) - MFAQ: a Multilingual FAQ Dataset [9.625301186732598]
We present the first multilingual FAQ dataset publicly available.
We collected around 6M FAQ pairs from the web, in 21 different languages.
We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset.
arXiv Detail & Related papers (2021-09-27T08:43:25Z) - XOR QA: Cross-lingual Open-Retrieval Question Answering [75.20578121267411]
This work extends open-retrieval question answering to a cross-lingual setting.
We construct a large-scale dataset built on questions lacking same-language answers.
arXiv Detail & Related papers (2020-10-22T16:47:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.