SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
- URL: http://arxiv.org/abs/2406.14425v3
- Date: Mon, 16 Sep 2024 21:52:55 GMT
- Title: SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
- Authors: Gayane Ghazaryan, Erik Arakelyan, Pasquale Minervini, Isabelle Augenstein,
- Abstract summary: Question Answering datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation.
We propose $textbfS$yn$textbfDAR$in, a method for generating and validating QA datasets for low-resource languages.
- Score: 44.85501254683431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $\textbf{S}$yn$\textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $\textit{human-curated}$ paragraphs between English and the target language. We use the English data as context to $\textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English $\textit{human-curated}$ paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with $1.2$K samples for the Armenian language. The human evaluation shows that $98\%$ of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out $\sim70\%$ of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.
Related papers
- Table Question Answering for Low-resourced Indic Languages [71.57359949962678]
TableQA is the task of answering questions over tables of structured information, returning individual cells or tables as output.
We introduce a fully automatic large-scale tableQA data generation process for low-resource languages with limited budget.
We incorporate our data generation method on two Indic languages, Bengali and Hindi, which have no tableQA datasets or models.
arXiv Detail & Related papers (2024-10-04T16:26:12Z) - Can a Multichoice Dataset be Repurposed for Extractive Question Answering? [52.28197971066953]
We repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA)
We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA).
Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced.
arXiv Detail & Related papers (2024-04-26T11:46:05Z) - LAMPAT: Low-Rank Adaption for Multilingual Paraphrasing Using Adversarial Training [19.173992333194683]
Paraphrases are texts that convey the same meaning while using different words or sentence structures.
Previous studies have leveraged the knowledge from the machine translation field, forming a paraphrase through zero-shot machine translation in the same language.
We propose the first unsupervised multilingual paraphrasing model, LAMPAT, by which monolingual dataset is sufficient enough to generate a human-like and diverse sentence.
arXiv Detail & Related papers (2024-01-09T04:19:16Z) - Leveraging Closed-Access Multilingual Embedding for Automatic Sentence
Alignment in Low Resource Languages [2.4023321876826462]
We present a simple but qualitative parallel sentence aligner that carefully leveraged the closed-access Cohere multilingual embedding.
The proposed approach achieved $94.96$ and $54.83$ f1 scores on FLORES and MAFAND-MT, compared to $3.64$ and $0.64$ of LASER respectively.
Our method also achieved an improvement of more than 5 BLEU scores over LASER, when the resulting datasets were used with MAFAND-MT to train translation models.
arXiv Detail & Related papers (2023-11-20T20:48:25Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - QAmeleon: Multilingual QA with Only 5 Examples [71.80611036543633]
We show how to leverage pre-trained language models under a few-shot learning setting.
Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained.
Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines.
arXiv Detail & Related papers (2022-11-15T16:14:39Z) - MuCoT: Multilingual Contrastive Training for Question-Answering in
Low-resource Languages [4.433842217026879]
Multi-lingual BERT-based models (mBERT) are often used to transfer knowledge from high-resource languages to low-resource languages.
We augment the QA samples of the target language using translation and transliteration into other languages and use the augmented data to fine-tune an mBERT-based QA model.
Experiments on the Google ChAII dataset show that fine-tuning the mBERT model with translations from the same language family boosts the question-answering performance.
arXiv Detail & Related papers (2022-04-12T13:52:54Z) - Synthetic Data Augmentation for Zero-Shot Cross-Lingual Question
Answering [8.558954185502012]
We propose a method to improve the Cross-lingual Question Answering performance without requiring additional annotated data.
We report a new state-of-the-art on four multilingual datasets: MLQA, XQuAD, SQuAD-it and PIAF (fr)
arXiv Detail & Related papers (2020-10-23T20:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.