Explicit Morphological Knowledge Improves Pre-training of Language
Models for Hebrew
- URL: http://arxiv.org/abs/2311.00658v1
- Date: Wed, 1 Nov 2023 17:02:49 GMT
- Title: Explicit Morphological Knowledge Improves Pre-training of Language
Models for Hebrew
- Authors: Eylon Gueta, Omer Goldman, Reut Tsarfaty
- Abstract summary: We investigate the hypothesis that incorporating explicit morphological knowledge in the pre-training phase can improve the performance of PLMs for morphologically rich languages.
We propose various morphologically driven tokenization methods enabling the model to leverage morphological cues beyond raw text.
Our experiments show that morphologically driven tokenization demonstrates improved results compared to a standard language-agnostic tokenization.
- Score: 19.4968960182412
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Pre-trained language models (PLMs) have shown remarkable successes in
acquiring a wide range of linguistic knowledge, relying solely on
self-supervised training on text streams. Nevertheless, the effectiveness of
this language-agnostic approach has been frequently questioned for its
sub-optimal performance when applied to morphologically-rich languages (MRLs).
We investigate the hypothesis that incorporating explicit morphological
knowledge in the pre-training phase can improve the performance of PLMs for
MRLs. We propose various morphologically driven tokenization methods enabling
the model to leverage morphological cues beyond raw text. We pre-train multiple
language models utilizing the different methods and evaluate them on Hebrew, a
language with complex and highly ambiguous morphology. Our experiments show
that morphologically driven tokenization demonstrates improved results compared
to a standard language-agnostic tokenization, on a benchmark of both semantic
and morphologic tasks. These findings suggest that incorporating morphological
knowledge holds the potential for further improving PLMs for morphologically
rich languages.
Related papers
- Morphological Typology in BPE Subword Productivity and Language Modeling [0.0]
We focus on languages with synthetic and analytical morphological structures and examine their productivity when tokenized.
Experiments reveal that languages with synthetic features exhibit greater subword regularity and productivity with BPE tokenization.
arXiv Detail & Related papers (2024-10-31T06:13:29Z) - Linguistically Grounded Analysis of Language Models using Shapley Head Values [2.914115079173979]
We investigate the processing of morphosyntactic phenomena by leveraging a recently proposed method for probing language models via Shapley Head Values (SHVs)
Using the English language BLiMP dataset, we test our approach on two widely used models, BERT and RoBERTa, and compare how linguistic constructions are handled.
Our results show that SHV-based attributions reveal distinct patterns across both models, providing insights into how language models organize and process linguistic information.
arXiv Detail & Related papers (2024-10-17T09:48:08Z) - Evaluating Morphological Compositional Generalization in Large Language Models [17.507983593566223]
We investigate the morphological generalization abilities of large language models (LLMs) through the lens of compositionality.
We focus on agglutinative languages such as Turkish and Finnish.
Our analysis shows that LLMs struggle with morphological compositional generalization particularly when applied to novel word roots.
While models can identify individual morphological combinations better than chance, their performance lacks systematicity, leading to significant accuracy gaps compared to humans.
arXiv Detail & Related papers (2024-10-16T15:17:20Z) - MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process.
Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z) - Improving Korean NLP Tasks with Linguistically Informed Subword
Tokenization and Sub-character Decomposition [6.767341847275751]
We introduce a morpheme-aware subword tokenization method that utilizes sub-character decomposition to address the challenges of applying Byte Pair.
Our approach balances linguistic accuracy with computational efficiency in Pre-trained Language Models (PLMs)
Our evaluations show that this technique achieves good performances overall, notably improving results in the syntactic task of NIKL-CoLA.
arXiv Detail & Related papers (2023-11-07T12:08:21Z) - LERT: A Linguistically-motivated Pre-trained Language Model [67.65651497173998]
We propose LERT, a pre-trained language model that is trained on three types of linguistic features along with the original pre-training task.
We carried out extensive experiments on ten Chinese NLU tasks, and the experimental results show that LERT could bring significant improvements.
arXiv Detail & Related papers (2022-11-10T05:09:16Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Morphology Matters: A Multilingual Language Modeling Analysis [8.791030561752384]
Prior studies disagree on whether inflectional morphology makes languages harder to model.
We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.
Several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data.
arXiv Detail & Related papers (2020-12-11T11:55:55Z) - Morphologically Aware Word-Level Translation [82.59379608647147]
We propose a novel morphologically aware probability model for bilingual lexicon induction.
Our model exploits the basic linguistic intuition that the lexeme is the key lexical unit of meaning.
arXiv Detail & Related papers (2020-11-15T17:54:49Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.