Alloprof: a new French question-answer education dataset and its use in
an information retrieval case study
- URL: http://arxiv.org/abs/2302.07738v2
- Date: Fri, 14 Apr 2023 13:20:07 GMT
- Title: Alloprof: a new French question-answer education dataset and its use in
an information retrieval case study
- Authors: Antoine Lefebvre-Brossard, Stephane Gazaille, Michel C. Desmarais
- Abstract summary: We introduce a new public French question-answering dataset from Alloprof, a Quebec-based help website.
This dataset contains 29 349 questions and their explanations in a variety of school subjects from 10 368 students.
To predict relevant documents, architectures using pre-trained BERT models were fine-tuned and evaluated.
- Score: 0.13750624267664155
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Teachers and students are increasingly relying on online learning resources
to supplement the ones provided in school. This increase in the breadth and
depth of available resources is a great thing for students, but only provided
they are able to find answers to their queries. Question-answering and
information retrieval systems have benefited from public datasets to train and
evaluate their algorithms, but most of these datasets have been in English text
written by and for adults. We introduce a new public French question-answering
dataset collected from Alloprof, a Quebec-based primary and high-school help
website, containing 29 349 questions and their explanations in a variety of
school subjects from 10 368 students, with more than half of the explanations
containing links to other questions or some of the 2 596 reference pages on the
website. We also present a case study of this dataset in an information
retrieval task. This dataset was collected on the Alloprof public forum, with
all questions verified for their appropriateness and the explanations verified
both for their appropriateness and their relevance to the question. To predict
relevant documents, architectures using pre-trained BERT models were fine-tuned
and evaluated. This dataset will allow researchers to develop
question-answering, information retrieval and other algorithms specifically for
the French speaking education context. Furthermore, the range of language
proficiency, images, mathematical symbols and spelling mistakes will
necessitate algorithms based on a multimodal comprehension. The case study we
present as a baseline shows an approach that relies on recent techniques
provides an acceptable performance level, but more work is necessary before it
can reliably be used and trusted in a production setting.
Related papers
- DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese.
We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.
We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z) - Large Language Models Meet Knowledge Graphs to Answer Factoid Questions [57.47634017738877]
We propose a method for exploring pre-trained Text-to-Text Language Models enriched with additional information from Knowledge Graphs.
We procure easily interpreted information with Transformer-based models through the linearization of the extracted subgraphs.
Final re-ranking of the answer candidates with the extracted information boosts Hits@1 scores of the pre-trained text-to-text language models by 4-6%.
arXiv Detail & Related papers (2023-10-03T15:57:00Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - EduQG: A Multi-format Multiple Choice Dataset for the Educational Domain [20.801638768447948]
This dataset contains 3,397 samples of multiple choice questions, answers (including distractors), and their source documents from the educational domain.
Each question is phrased in two forms, normal and close. Correct answers are linked to source documents with sentence-level annotations.
All questions have been generated by educational experts rather than crowd workers to ensure they are maintaining educational and learning standards.
arXiv Detail & Related papers (2022-10-12T11:28:34Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - A Dataset of Information-Seeking Questions and Answers Anchored in
Research Papers [66.11048565324468]
We present a dataset of 5,049 questions over 1,585 Natural Language Processing papers.
Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text.
We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers.
arXiv Detail & Related papers (2021-05-07T00:12:34Z) - English Machine Reading Comprehension Datasets: A Survey [13.767812547998735]
We categorize the datasets according to their question and answer form and compare them across various dimensions including size, vocabulary, data source, method of creation, human performance level, and first question word.
Our analysis reveals that Wikipedia is by far the most common data source and that there is a relative lack of why, when, and where questions across datasets.
arXiv Detail & Related papers (2021-01-25T21:15:06Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z) - Educational Question Mining At Scale: Prediction, Analysis and
Personalization [35.42197158180065]
We propose a framework for mining insights from educational questions at scale.
We utilize the state-of-the-art Bayesian deep learning method, in particular partial variational auto-encoders (p-VAE)
We apply our proposed framework to a real-world dataset with tens of thousands of questions and tens of millions of answers from an online education platform.
arXiv Detail & Related papers (2020-03-12T19:07:49Z) - Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using
Zero-shot Learning [30.868309879441615]
We tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on English collections to non-English queries and documents.
Our results show that the proposed approach can significantly outperform unsupervised retrieval techniques for Arabic, Chinese Mandarin, and Spanish.
arXiv Detail & Related papers (2019-12-30T20:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.