Related papers: Morphological Typology in BPE Subword Productivity and Language Modeling

Morphological Typology in BPE Subword Productivity and Language Modeling

URL: http://arxiv.org/abs/2410.23656v1
Date: Thu, 31 Oct 2024 06:13:29 GMT
Title: Morphological Typology in BPE Subword Productivity and Language Modeling
Authors: Iñigo Parra,
Abstract summary: We focus on languages with synthetic and analytical morphological structures and examine their productivity when tokenized. Experiments reveal that languages with synthetic features exhibit greater subword regularity and productivity with BPE tokenization.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study investigates the impact of morphological typology on tokenization and language modeling performance. We focus on languages with synthetic and analytical morphological structures and examine their productivity when tokenized using the byte-pair encoding (BPE) algorithm. We compare the performance of models trained with similar amounts of data in different languages. Our experiments reveal that languages with synthetic features exhibit greater subword regularity and productivity with BPE tokenization and achieve better results in language modeling tasks. We also observe that the typological continuum from linguistic theory is reflected in several experiments. These findings suggest a correlation between morphological typology and BPE tokenization efficiency.

Related papers

Can Language Models Learn Typologically Implausible Languages? [62.823015163987996]
Grammatical features across human languages show intriguing correlations often attributed to learning biases in humans. We discuss how language models (LMs) allow us to better determine the role of domain-general learning biases in language universals. We test LMs on an array of highly naturalistic but counterfactual versions of the English (head-initial) and Japanese (head-final) languages.
arXiv Detail & Related papers (2025-02-17T20:40:01Z)
Analysis and Visualization of Linguistic Structures in Large Language Models: Neural Representations of Verb-Particle Constructions in BERT [0.0]
This study investigates the internal representations of verb-particle combinations within large language models (LLMs) We analyse the representational efficacy of its layers for various verb-particle constructions such as 'agree on', 'come back', and 'give up' Results show that BERT's middle layers most effectively capture syntactic structures, with significant variability in representational accuracy across different verb categories.
arXiv Detail & Related papers (2024-12-19T09:21:39Z)
Analyzing The Language of Visual Tokens [48.62180485759458]
We take a natural-language-centric approach to analyzing discrete visual languages. We show that higher token innovation drives greater entropy and lower compression, with tokens predominantly representing object parts. We also show that visual languages lack cohesive grammatical structures, leading to higher perplexity and weaker hierarchical organization compared to natural languages.
arXiv Detail & Related papers (2024-11-07T18:59:28Z)
Explicit Morphological Knowledge Improves Pre-training of Language Models for Hebrew [19.4968960182412]
We investigate the hypothesis that incorporating explicit morphological knowledge in the pre-training phase can improve the performance of PLMs for morphologically rich languages. We propose various morphologically driven tokenization methods enabling the model to leverage morphological cues beyond raw text. Our experiments show that morphologically driven tokenization demonstrates improved results compared to a standard language-agnostic tokenization.
arXiv Detail & Related papers (2023-11-01T17:02:49Z)
Quantifying Synthesis and Fusion and their Impact on Machine Translation [79.61874492642691]
In Natural Language Processing (NLP) typically labels a whole language with a strict type of morphology, e.g. fusional or agglutinative. In this work, we propose to reduce the rigidity of such claims, by quantifying morphological typology at the word and segment level. For computing literature, we test unsupervised and supervised morphological segmentation methods for English, German and Turkish, whereas for fusion, we propose a semi-automatic method using Spanish as a case study. Then, we analyse the relationship between machine translation quality and the degree of synthesis and fusion at word (nouns and verbs for English-Turkish,
arXiv Detail & Related papers (2022-05-06T17:04:58Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space. We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance. We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z)
On the Transferability of Neural Models of Morphological Analogies [7.89271130004391]
In this paper, we focus on morphological tasks and we propose a deep learning approach to detect morphological analogies. We present an empirical study to see how our framework transfers across languages, and that highlights interesting similarities and differences between these languages. In view of these results, we also discuss the possibility of building a multilingual morphological model.
arXiv Detail & Related papers (2021-08-09T11:08:33Z)
Morphology Matters: A Multilingual Language Modeling Analysis [8.791030561752384]
Prior studies disagree on whether inflectional morphology makes languages harder to model. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features. Several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data.
arXiv Detail & Related papers (2020-12-11T11:55:55Z)
Exploiting Syntactic Structure for Better Language Modeling: A Syntactic Distance Approach [78.77265671634454]
We make use of a multi-task objective, i.e., the models simultaneously predict words as well as ground truth parse trees in a form called "syntactic distances" Experimental results on the Penn Treebank and Chinese Treebank datasets show that when ground truth parse trees are provided as additional training signals, the model is able to achieve lower perplexity and induce trees with better quality.
arXiv Detail & Related papers (2020-05-12T15:35:00Z)
Linguistic Typology Features from Text: Inferring the Sparse Features of World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers. We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z)
Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models [27.91397366776451]
Training LSTMs on latent structure (MIDI music or Java code) improves test performance on natural language. Experiments on transfer between natural languages controlling for vocabulary overlap show that zero-shot performance on a test language is highly correlated with typological similarity to the training language.
arXiv Detail & Related papers (2020-04-30T06:24:03Z)
Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures. We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.