Related papers: FQuAD: French Question Answering Dataset

FQuAD: French Question Answering Dataset

URL: http://arxiv.org/abs/2002.06071v2
Date: Mon, 25 May 2020 17:09:17 GMT
Title: FQuAD: French Question Answering Dataset
Authors: Martin d'Hoffschmidt, Wacim Belblidia, Tom Brendl\'e, Quentin Heinrich, Maxime Vidal
Abstract summary: We introduce the French Question Answering dataset (FQuAD) FQuAD is a French Native Reading dataset of questions and answers on a set of Wikipedia articles. We train a baseline model which achieves an F1 score of 92.2 and an exact match ratio of 82.1 on the test set.
Score: 0.4759823735082845
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in the field of language modeling have improved state-of-the-art results on many Natural Language Processing tasks. Among them, Reading Comprehension has made significant progress over the past few years. However, most results are reported in English since labeled resources available in other languages, such as French, remain scarce. In the present work, we introduce the French Question Answering Dataset (FQuAD). FQuAD is a French Native Reading Comprehension dataset of questions and answers on a set of Wikipedia articles that consists of 25,000+ samples for the 1.0 version and 60,000+ samples for the 1.1 version. We train a baseline model which achieves an F1 score of 92.2 and an exact match ratio of 82.1 on the test set. In order to track the progress of French Question Answering models we propose a leader-board and we have made the 1.0 version of our dataset freely available at https://illuin-tech.github.io/FQuAD-explorer/.

Related papers

Lugha-Llama: Adapting Large Language Models for African Languages [48.97516583523523]
Large language models (LLMs) have achieved impressive results in a wide range of natural language applications. We consider how to adapt LLMs to low-resource African languages. We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages.
arXiv Detail & Related papers (2025-04-09T02:25:53Z)
WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense Retrieval [0.8478469524684645]
WebFAQ is a large-scale collection of open-domain question answering datasets derived from FAQ-style schema.org annotations. In total, the data collection consists of 96 million natural question-answer (QA) pairs across 75 languages, including 47 million (49%) non-English samples. WebFAQ serves as the foundation for 20 monolingual retrieval benchmarks with a total size of 11.2 million QA pairs.
arXiv Detail & Related papers (2025-02-28T10:46:52Z)
UQA: Corpus for Urdu Question Answering [3.979019316355144]
This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu. UQA is generated by translating the Stanford Question Answering dataset (SQuAD2.0), a large-scale English QA dataset. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T.
arXiv Detail & Related papers (2024-05-02T16:44:31Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
Cross-Lingual Knowledge Distillation for Answer Sentence Selection in Low-Resource Languages [90.41827664700847]
We propose Cross-Lingual Knowledge Distillation (CLKD) from a strong English AS2 teacher as a method to train AS2 models for low-resource languages. To evaluate our method, we introduce 1) Xtr-WikiQA, a translation-based WikiQA dataset for 9 additional languages, and 2) TyDi-AS2, a multilingual AS2 dataset with over 70K questions spanning 8 typologically diverse languages.
arXiv Detail & Related papers (2023-05-25T17:56:04Z)
An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages. We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z)
PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages. We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts. We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
TaTa: A Multilingual Table-to-Text Dataset for African Languages [32.348630887289524]
Table-to-Text in African languages (TaTa) is the first large multilingual table-to-text dataset with a focus on African languages. TaTa includes 8,700 examples in nine languages including four African languages (Hausa, Igbo, Swahili, and Yorub'a) and a zero-shot test language (Russian)
arXiv Detail & Related papers (2022-10-31T21:05:42Z)
UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension [0.0]
This work explores the semi-automated creation of the Urdu Question Answering dataset (UQuAD1.0) In UQuAD1.0, 45000 pairs of QA were generated by machine translation of the original SQuAD1.0 and approximately 4000 pairs via crowdsourcing. Using XLMRoBERTa and multi-lingual BERT, we acquire an F1 score of 0.66 and 0.63, respectively.
arXiv Detail & Related papers (2021-11-02T12:25:04Z)
FQuAD2.0: French Question Answering and knowing that you know nothing [0.25782420501870296]
We introduce FQuAD2.0, which extends FQuAD with 17,000+ unanswerable questions. This dataset makes it possible to train French Question Answering models with the ability of distinguishing unanswerable questions from answerable ones.
arXiv Detail & Related papers (2021-09-27T17:30:46Z)
MFAQ: a Multilingual FAQ Dataset [9.625301186732598]
We present the first multilingual FAQ dataset publicly available. We collected around 6M FAQ pairs from the web, in 21 different languages. We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset.
arXiv Detail & Related papers (2021-09-27T08:43:25Z)
CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English. It diversified with over 11,000 speakers and over 60 accents. CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.