MuLVE, A Multi-Language Vocabulary Evaluation Data Set
- URL: http://arxiv.org/abs/2201.06286v1
- Date: Mon, 17 Jan 2022 09:02:59 GMT
- Title: MuLVE, A Multi-Language Vocabulary Evaluation Data Set
- Authors: Anik Jacobsen, Salar Mohtaj, Sebastian M\"oller
- Abstract summary: This work introduces Multi-Language Vocabulary Evaluation Data Set (MuLVE), a data set consisting of vocabulary cards and real-life user answers.
The data set contains vocabulary questions in German and English, Spanish, and French as target language.
We experiment to fine-tune pre-trained BERT language models on the downstream task of vocabulary evaluation with the proposed MuLVE data set.
- Score: 2.9005223064604078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vocabulary learning is vital to foreign language learning. Correct and
adequate feedback is essential to successful and satisfying vocabulary
training. However, many vocabulary and language evaluation systems perform on
simple rules and do not account for real-life user learning data. This work
introduces Multi-Language Vocabulary Evaluation Data Set (MuLVE), a data set
consisting of vocabulary cards and real-life user answers, labeled indicating
whether the user answer is correct or incorrect. The data source is user
learning data from the Phase6 vocabulary trainer. The data set contains
vocabulary questions in German and English, Spanish, and French as target
language and is available in four different variations regarding pre-processing
and deduplication. We experiment to fine-tune pre-trained BERT language models
on the downstream task of vocabulary evaluation with the proposed MuLVE data
set. The results provide outstanding results of > 95.5 accuracy and F2-score.
The data set is available on the European Language Grid.
Related papers
- Are BabyLMs Second Language Learners? [48.85680614529188]
This paper describes a linguistically-motivated approach to the 2024 edition of the BabyLM Challenge.
Rather than pursuing a first language learning (L1) paradigm, we approach the challenge from a second language (L2) learning perspective.
arXiv Detail & Related papers (2024-10-28T17:52:15Z) - Is Child-Directed Speech Effective Training Data for Language Models? [34.46268640655943]
We train GPT-2 and RoBERTa models on 29M words of English child-directed speech.
We test whether the global developmental ordering or the local discourse ordering of children's training data supports high performance relative to other datasets.
These findings support the hypothesis that, rather than proceeding from better data, the child's learning algorithm is substantially more data-efficient than current language modeling techniques.
arXiv Detail & Related papers (2024-08-07T08:18:51Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - XFEVER: Exploring Fact Verification across Languages [40.1637899493061]
This paper introduces the Cross-lingual Fact Extraction and VERification dataset designed for benchmarking the fact verification models across different languages.
We constructed it by translating the claim and evidence texts of the Fact Extraction and VERification dataset into six languages.
The training and development sets were translated using machine translation, whereas the test set includes texts translated by professional translators and machine-translated texts.
arXiv Detail & Related papers (2023-10-25T01:20:17Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Leveraging Language Identification to Enhance Code-Mixed Text
Classification [0.7340017786387767]
Existing deep-learning models do not take advantage of the implicit language information in code-mixed text.
Our study aims to improve BERT-based models performance on low-resource Code-Mixed Hindi-English datasets.
arXiv Detail & Related papers (2023-06-08T06:43:10Z) - Pedagogical Word Recommendation: A novel task and dataset on
personalized vocabulary acquisition for L2 learners [4.507860128918788]
We propose and release data for a novel task called Pedagogical Word Recommendation.
The main goal of PWR is to predict whether a given learner knows a given word based on other words the learner has already seen.
As a feature of this ITS, students can directly indicate words they do not know from the questions they solved to create wordbooks.
arXiv Detail & Related papers (2021-12-27T17:52:48Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Multilingual Neural Semantic Parsing for Low-Resourced Languages [1.6244541005112747]
We introduce a new multilingual semantic parsing dataset in English, Italian and Japanese.
We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset.
We find that a semantic trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
arXiv Detail & Related papers (2021-06-07T09:53:02Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.