Multilingual Sequence-to-Sequence Models for Hebrew NLP
- URL: http://arxiv.org/abs/2212.09682v1
- Date: Mon, 19 Dec 2022 18:10:23 GMT
- Title: Multilingual Sequence-to-Sequence Models for Hebrew NLP
- Authors: Matan Eyal, Hila Noga, Roee Aharoni, Idan Szpektor, Reut Tsarfaty
- Abstract summary: We show that sequence-to-sequence generative architectures are more suitable for morphologically rich languages (MRLs) such as Hebrew.
We demonstrate that by casting tasks in the Hebrew NLP pipeline as text-to-text tasks, we can leverage powerful multilingual, pretrained sequence-to-sequence models as mT5.
- Score: 16.010560946005473
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work attributes progress in NLP to large language models (LMs) with
increased model size and large quantities of pretraining data. Despite this,
current state-of-the-art LMs for Hebrew are both under-parameterized and
under-trained compared to LMs in other languages. Additionally, previous work
on pretrained Hebrew LMs focused on encoder-only models. While the encoder-only
architecture is beneficial for classification tasks, it does not cater well for
sub-word prediction tasks, such as Named Entity Recognition, when considering
the morphologically rich nature of Hebrew. In this paper we argue that
sequence-to-sequence generative architectures are more suitable for LLMs in the
case of morphologically rich languages (MRLs) such as Hebrew. We demonstrate
that by casting tasks in the Hebrew NLP pipeline as text-to-text tasks, we can
leverage powerful multilingual, pretrained sequence-to-sequence models as mT5,
eliminating the need for a specialized, morpheme-based, separately fine-tuned
decoder. Using this approach, our experiments show substantial improvements
over previously published results on existing Hebrew NLP benchmarks. These
results suggest that multilingual sequence-to-sequence models present a
promising building block for NLP for MRLs.
Related papers
- Investigating the translation capabilities of Large Language Models trained on parallel data only [1.5974665548135587]
Large Language Models (LLMs) have demonstrated exceptional proficiency across a broad spectrum of Natural Language Processing (NLP) tasks.
We introduce PLUME, a collection of three 2B LLMs featuring varying vocabulary sizes (32k, 128k, and 256k) trained exclusively on Catalan-centric parallel examples.
These models perform comparably to previous encoder-decoder architectures on 16 supervised translation directions and 56 zero-shot ones.
arXiv Detail & Related papers (2024-06-13T14:08:56Z) - Benchmarking Pre-trained Large Language Models' Potential Across Urdu NLP tasks [0.9786690381850356]
Large Language Models (LLMs) pre-trained on multilingual data have revolutionized natural language processing research.
This study presents an in-depth examination of prominent LLMs, across 14 tasks using 15 Urdu datasets.
Experiments show that SOTA models surpass all the encoder-decoder pre-trained language models in all Urdu NLP tasks with zero-shot learning.
arXiv Detail & Related papers (2024-05-24T11:30:37Z) - TAMS: Translation-Assisted Morphological Segmentation [3.666125285899499]
We present a sequence-to-sequence model for canonical morpheme segmentation.
Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data.
While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings.
arXiv Detail & Related papers (2024-03-21T21:23:35Z) - PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for
Translation with Semi-Supervised Pseudo-Parallel Document Generation [5.004814662623874]
This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training.
Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks.
arXiv Detail & Related papers (2023-04-03T18:19:26Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.