Related papers: Effects of sub-word segmentation on performance of transformer language models

Effects of sub-word segmentation on performance of transformer language models

URL: http://arxiv.org/abs/2305.05480v3
Date: Thu, 26 Oct 2023 20:36:36 GMT
Title: Effects of sub-word segmentation on performance of transformer language models
Authors: Jue Hou, Anisia Katinskaia, Anh-Duc Vu and Roman Yangarber
Abstract summary: We compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation. We show that training with morphological segmentation allows the LMs to: 1. achieve lower perplexity, 2. converge more efficiently in terms of training time, and 3. achieve equivalent or better evaluation scores on downstream tasks.
Score: 0.628122931748758
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Language modeling is a fundamental task in natural language processing, which has been thoroughly explored with various architectures and hyperparameters. However, few studies focus on the effect of sub-word segmentation on the performance of language models (LMs). In this paper, we compare GPT and BERT models trained with the statistical segmentation algorithm BPE vs. two unsupervised algorithms for morphological segmentation -- Morfessor and StateMorph. We train the models for several languages -- including ones with very rich morphology -- and compare their performance with different segmentation algorithms, vocabulary sizes, and model sizes. The results show that training with morphological segmentation allows the LMs to: 1. achieve lower perplexity, 2. converge more efficiently in terms of training time, and 3. achieve equivalent or better evaluation scores on downstream tasks. Lastly, we show 4. that LMs of smaller size using morphological segmentation can perform comparably to models of larger size trained with BPE -- both in terms of (1) perplexity and (3) scores on downstream tasks. Points (2) and (4) impact on sustainability of LMs, since they reduce the model cost: size and computation time. While (2) reduces cost only in the training phase, (4) does so also in the inference phase.

Related papers

From Smør-re-brød to Subwords: Training LLMs on Danish, One Morpheme at a Time [8.28573483085828]
We leverage an annotated Danish morphological dataset to train a semi-supervised model for morphological segmentation. We evaluate four distinct tokenizers, including two custom morphological tokenizers, by analyzing their performance intextly segmenting Danish words. Our findings reveal that our custom-developed tokenizers substantially enhance morphological segmentation, achieving an F1 score of 58.84, compared to 39.28 achieved by a Danish BPE tokenizer.
arXiv Detail & Related papers (2025-04-02T09:26:02Z)
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies [0.8029049649310211]
Tokenization is fundamental to Natural Language Processing (NLP) MorphBPE is a morphology-aware extension of Byte Pair. It integrates linguistic structure into subword tokenization while preserving statistical efficiency.
arXiv Detail & Related papers (2025-02-02T20:06:39Z)
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models. We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design. Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z)
In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL) We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z)
Split and Rephrase with Large Language Models [2.499907423888049]
Split and Rephrase (SPRP) task consists in splitting complex sentences into a sequence of shorter grammatical sentences. We evaluate large language models on the task, showing that they can provide large improvements over the state of the art on the main metrics.
arXiv Detail & Related papers (2023-12-18T10:16:37Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT) This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method. SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z)
MorphPiece : A Linguistic Tokenizer for Large Language Models [3.8073142980733]
I propose a linguistically motivated tokenization scheme, MorphPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows comparable or superior performance on a variety of supervised and unsupervised NLP tasks.
arXiv Detail & Related papers (2023-07-14T10:35:04Z)
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM. For learning methods, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
Subword Segmental Language Modelling for Nguni Languages [7.252933737829635]
Subword segmental language model (SSLM) learns how to segment words while being trained for autoregressive language modelling. We train our model on the 4 Nguni languages of South Africa. Our results show that learning subword segmentation is an effective alternative to existing subword segmenters.
arXiv Detail & Related papers (2022-10-12T18:41:00Z)
The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation [0.6091702876917281]
This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting. We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations from LMVR and MORSEL.
arXiv Detail & Related papers (2021-03-20T14:39:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.