Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian
- URL: http://arxiv.org/abs/2404.08617v1
- Date: Fri, 12 Apr 2024 17:27:54 GMT
- Title: Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian
- Authors: Aleksa Cvetanović, Predrag Tadić,
- Abstract summary: We create the largest Serbian QA dataset of more than 87K samples, which we name SQuAD-sr.
To acknowledge the script duality in Serbian, we generated both Cyrillic and Latin versions of the dataset.
Best results were obtained by fine-tuning the BERTi'c model on our Latin SQuAD-sr dataset, achieving 73.91% Exact Match and 82.97% F1 score.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we focus on generating a synthetic question answering (QA) dataset using an adapted Translate-Align-Retrieve method. Using this method, we created the largest Serbian QA dataset of more than 87K samples, which we name SQuAD-sr. To acknowledge the script duality in Serbian, we generated both Cyrillic and Latin versions of the dataset. We investigate the dataset quality and use it to fine-tune several pre-trained QA models. Best results were obtained by fine-tuning the BERTi\'c model on our Latin SQuAD-sr dataset, achieving 73.91% Exact Match and 82.97% F1 score on the benchmark XQuAD dataset, which we translated into Serbian for the purpose of evaluation. The results show that our model exceeds zero-shot baselines, but fails to go beyond human performance. We note the advantage of using a monolingual pre-trained model over multilingual, as well as the performance increase gained by using Latin over Cyrillic. By performing additional analysis, we show that questions about numeric values or dates are more likely to be answered correctly than other types of questions. Finally, we conclude that SQuAD-sr is of sufficient quality for fine-tuning a Serbian QA model, in the absence of a manually crafted and annotated dataset.
Related papers
- AmaSQuAD: A Benchmark for Amharic Extractive Question Answering [0.0]
This research presents a novel framework for translating extractive question-answering datasets into low-resource languages.
The methodology addresses challenges related to misalignment between translated questions and answers.
We fine-tune the XLM-R model on the AmaSQuAD synthetic dataset for Amharic Question-Answering.
arXiv Detail & Related papers (2025-02-04T06:27:39Z) - Building a Rich Dataset to Empower the Persian Question Answering Systems [0.6138671548064356]
This dataset is called NextQuAD and has 7,515 contexts, including 23,918 questions and answers.
BERT-based question answering model has been applied to this dataset using two pre-trained language models.
Evaluation on the development set shows 0.95 Exact Match (EM) and 0.97 Fl_score.
arXiv Detail & Related papers (2024-12-28T16:53:25Z) - KET-QA: A Dataset for Knowledge Enhanced Table Question Answering [63.56707527868466]
We propose to use a knowledge base (KB) as the external knowledge source for TableQA.
Every question requires the integration of information from both the table and the sub-graph to be answered.
We design a retriever-reasoner structured pipeline model to extract pertinent information from the vast knowledge sub-graph.
arXiv Detail & Related papers (2024-05-13T18:26:32Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Question Answering and Question Generation for Finnish [0.8426855646402236]
We present the first neural QA and QG models that work with Finnish.
To train the models, we automatically translate the SQuAD dataset.
Using the synthetic data, together with the Finnish partition of the TyDi-QA dataset, we fine-tune several transformer-based models to both QA and QG.
arXiv Detail & Related papers (2022-11-24T20:40:00Z) - QAmeleon: Multilingual QA with Only 5 Examples [71.80611036543633]
We show how to leverage pre-trained language models under a few-shot learning setting.
Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained.
Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines.
arXiv Detail & Related papers (2022-11-15T16:14:39Z) - Czech Dataset for Cross-lingual Subjectivity Classification [13.70633147306388]
We introduce a new Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions.
Two annotators annotated the dataset reaching 0.83 of the Cohen's kappa inter-annotator agreement.
We fine-tune five pre-trained BERT-like models to set a monolingual baseline for the new dataset and we achieve 93.56% of accuracy.
arXiv Detail & Related papers (2022-04-29T07:31:46Z) - Template-Based Question Generation from Retrieved Sentences for Improved
Unsupervised Question Answering [98.48363619128108]
We propose an unsupervised approach to training QA models with generated pseudo-training data.
We show that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance.
arXiv Detail & Related papers (2020-04-24T17:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.