Reading Comprehension in Czech via Machine Translation and Cross-lingual
Transfer
- URL: http://arxiv.org/abs/2007.01667v1
- Date: Fri, 3 Jul 2020 13:09:37 GMT
- Title: Reading Comprehension in Czech via Machine Translation and Cross-lingual
Transfer
- Authors: Kate\v{r}ina Mackov\'a, Milan Straka
- Abstract summary: This work focuses on building reading comprehension systems for Czech, without requiring any manually annotated Czech training data.
We automatically translated SQuAD 1.1 and SQuAD 2.0 datasets to Czech to create training and development data.
We then trained and evaluated several BERT and XLM-RoBERTa baseline models.
- Score: 2.8273701718153563
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reading comprehension is a well studied task, with huge training datasets in
English. This work focuses on building reading comprehension systems for Czech,
without requiring any manually annotated Czech training data. First of all, we
automatically translated SQuAD 1.1 and SQuAD 2.0 datasets to Czech to create
training and development data, which we release at
http://hdl.handle.net/11234/1-3249. We then trained and evaluated several BERT
and XLM-RoBERTa baseline models. However, our main focus lies in cross-lingual
transfer models. We report that a XLM-RoBERTa model trained on English data and
evaluated on Czech achieves very competitive performance, only approximately 2
percent points worse than a~model trained on the translated Czech data. This
result is extremely good, considering the fact that the model has not seen any
Czech data during training. The cross-lingual transfer approach is very
flexible and provides a reading comprehension in any language, for which we
have enough monolingual raw texts.
Related papers
- CroissantLLM: A Truly Bilingual French-English Language Model [42.03897426049679]
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens.
We pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio.
To assess performance outside of English, we craft a novel benchmark, FrenchBench.
arXiv Detail & Related papers (2024-02-01T17:17:55Z) - CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual
Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages.
We present the first fact-checking framework augmented with crosslingual retrieval.
We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z) - Czech Dataset for Cross-lingual Subjectivity Classification [13.70633147306388]
We introduce a new Czech subjectivity dataset of 10k manually annotated subjective and objective sentences from movie reviews and descriptions.
Two annotators annotated the dataset reaching 0.83 of the Cohen's kappa inter-annotator agreement.
We fine-tune five pre-trained BERT-like models to set a monolingual baseline for the new dataset and we achieve 93.56% of accuracy.
arXiv Detail & Related papers (2022-04-29T07:31:46Z) - CsFEVER and CTKFacts: Czech Datasets for Fact Verification [0.0]
We present two Czech datasets aimed for training automated fact-checking machine learning models.
The first dataset is CsFEVER of approximately 112k claims which is an automatically generated Czech version of the well-known Wikipedia-based FEVER dataset.
The second dataset CTKFacts of 3,097 claims is built on the corpus of approximately two million Czech News Agency news reports.
arXiv Detail & Related papers (2022-01-26T18:48:42Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - Bilingual Alignment Pre-training for Zero-shot Cross-lingual Transfer [33.680292990007366]
In this paper, we aim to improve the zero-shot cross-lingual transfer performance by aligning the embeddings better.
We propose a pre-training task named Alignment Language Model (AlignLM) which uses the statistical alignment information as the prior knowledge to guide bilingual word prediction.
The results show AlignLM can improve the zero-shot performance significantly on MLQA and XNLI datasets.
arXiv Detail & Related papers (2021-06-03T10:18:43Z) - Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks.
Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension.
This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.