As good as new. How to successfully recycle English GPT-2 to make models
for other languages
- URL: http://arxiv.org/abs/2012.05628v1
- Date: Thu, 10 Dec 2020 12:27:16 GMT
- Title: As good as new. How to successfully recycle English GPT-2 to make models
for other languages
- Authors: Wietse de Vries, Malvina Nissim
- Abstract summary: We describe the adaptation of English GPT-2 to Italian and Dutch by retraining lexical embeddings without tuning the Transformer layers.
We show how to scale up complexity by transforming relearned lexical embeddings of GPT-2 small to the GPT-2 medium embedding space.
English GPT-2 models with relearned lexical embeddings can generate realistic sentences in Italian and Dutch, but on average these sentences are still identifiable as artificial by humans.
- Score: 3.6042575355093907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large generative language models have been very successful for English, but
other languages lag behind due to data and computational limitations. We
propose a method that may overcome these problems by adapting existing
pre-trained language models to new languages. Specifically, we describe the
adaptation of English GPT-2 to Italian and Dutch by retraining lexical
embeddings without tuning the Transformer layers. As a result, we obtain
lexical embeddings for Italian and Dutch that are aligned with the original
English lexical embeddings and induce a bilingual lexicon from this alignment.
Additionally, we show how to scale up complexity by transforming relearned
lexical embeddings of GPT-2 small to the GPT-2 medium embedding space. This
method minimises the amount of training and prevents losing information during
adaptation that was learned by GPT-2. English GPT-2 models with relearned
lexical embeddings can generate realistic sentences in Italian and Dutch, but
on average these sentences are still identifiable as artificial by humans.
Based on perplexity scores and human judgements, we find that generated
sentences become more realistic with some additional full model finetuning,
especially for Dutch. For Italian, we see that they are evaluated on par with
sentences generated by a GPT-2 model fully trained from scratch. Our work can
be conceived as a blueprint for training GPT-2s for other languages, and we
provide a 'recipe' to do so.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Data-to-text Generation for Severely Under-Resourced Languages with
GPT-3.5: A Bit of Help Needed from Google Translate [5.632410663467911]
We look at how language learning systems cope with tasks involving languages that are severely under-represented in their training data.
This includes data-to-text generation for Irish, Maltese, Welsh and Breton.
We find that few-shot prompting works better for direct generation into under-resourced languages, but that the difference disappears when pivoting via English.
We conclude that good performance on under-resourced languages can be achieved out-of-the box with state-of-the-art LLMs.
arXiv Detail & Related papers (2023-08-19T09:19:34Z) - mGPT: Few-Shot Learners Go Multilingual [1.4354798873010843]
This paper introduces two autoregressive GPT-like models with 1.3 billion and 13 billion parameters trained on 60 languages.
We reproduce the GPT-3 architecture using GPT-2 sources and the sparse attention mechanism.
The resulting models show performance on par with the recently released XGLM models by Facebook.
arXiv Detail & Related papers (2022-04-15T13:02:33Z) - A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction.
Our approach creates diverse parallel GEC data without any language-specific operations.
It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - Multilingual Translation via Grafting Pre-trained Language Models [12.787188625198459]
We propose Graformer to graft separately pre-trained (masked) language models for machine translation.
With monolingual data for pre-training and parallel data for grafting training, we maximally take advantage of the usage of both types of data.
arXiv Detail & Related papers (2021-09-11T10:57:45Z) - Methods for Detoxification of Texts for the Russian Language [55.337471467610094]
We introduce the first study of automatic detoxification of Russian texts to combat offensive language.
We test two types of models - unsupervised approach that performs local corrections and supervised approach based on pretrained language GPT-2 model.
The results show that the tested approaches can be successfully used for detoxification, although there is room for improvement.
arXiv Detail & Related papers (2021-05-19T10:37:44Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z) - Improving Language Generation with Sentence Coherence Objective [4.997730662279843]
Existing models are often prone to output paragraphs of texts that gradually diverge from the given prompt.
The goal of our project is to improve the coherence and consistency across sentences in a language-generation model.
arXiv Detail & Related papers (2020-09-07T06:10:03Z) - Assessing Discourse Relations in Language Generation from GPT-2 [37.30382375828105]
GPT-2 is suited for generation tasks given its left-to-right language modeling objective.
We study the validity of explicit discourse relations in GPT-2's outputs under both organic generation and fine-tuned scenarios.
arXiv Detail & Related papers (2020-04-26T23:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.