Morfessor EM+Prune: Improved Subword Segmentation with Expectation
Maximization and Pruning
- URL: http://arxiv.org/abs/2003.03131v1
- Date: Fri, 6 Mar 2020 10:58:59 GMT
- Title: Morfessor EM+Prune: Improved Subword Segmentation with Expectation
Maximization and Pruning
- Authors: Stig-Arne Gr\"onroos, Sami Virpioja, Mikko Kurimo
- Abstract summary: We discuss and compare training algorithms for a unigram subword model, based on the Expectation Maximization algorithm and lexicon pruning.
We show that this approach is able to find better solutions to the optimization problem defined by the Morfessor Baseline model.
The improved optimization also leads to higher morphological segmentation accuracy when compared to a linguistic gold standard.
- Score: 14.116412358534442
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Data-driven segmentation of words into subword units has been used in various
natural language processing applications such as automatic speech recognition
and statistical machine translation for almost 20 years. Recently it has became
more widely adopted, as models based on deep neural networks often benefit from
subword units even for morphologically simpler languages. In this paper, we
discuss and compare training algorithms for a unigram subword model, based on
the Expectation Maximization algorithm and lexicon pruning. Using English,
Finnish, North Sami, and Turkish data sets, we show that this approach is able
to find better solutions to the optimization problem defined by the Morfessor
Baseline model than its original recursive training algorithm. The improved
optimization also leads to higher morphological segmentation accuracy when
compared to a linguistic gold standard. We publish implementations of the new
algorithms in the widely-used Morfessor software package.
Related papers
- MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [75.2540291039202]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost.
We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z) - Lexically Grounded Subword Segmentation [0.0]
We present three innovations in tokenization and subword segmentation.
First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization.
Second, we present an method for obtaining subword embeddings grounded in a word embedding space.
Third, we introduce an efficient segmentation algorithm based on a subword bigram model.
arXiv Detail & Related papers (2024-06-19T13:48:19Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Tokenization with Factorized Subword Encoding [2.538209532048867]
We propose a novel tokenization method that factorizes subwords onto discrete triplets using a VQ-VAE model.
Results indicate that this method is more appropriate and robust for morphological tasks than the commonly used byte-pair encoding (BPE) tokenization algorithm.
arXiv Detail & Related papers (2023-06-13T13:27:34Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Subword Segmental Language Modelling for Nguni Languages [7.252933737829635]
Subword segmental language model (SSLM) learns how to segment words while being trained for autoregressive language modelling.
We train our model on the 4 Nguni languages of South Africa.
Our results show that learning subword segmentation is an effective alternative to existing subword segmenters.
arXiv Detail & Related papers (2022-10-12T18:41:00Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations.
Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.