Related papers: Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning

Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning

URL: http://arxiv.org/abs/2003.03131v1
Date: Fri, 6 Mar 2020 10:58:59 GMT
Title: Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning
Authors: Stig-Arne Gr\"onroos, Sami Virpioja, Mikko Kurimo
Abstract summary: We discuss and compare training algorithms for a unigram subword model, based on the Expectation Maximization algorithm and lexicon pruning. We show that this approach is able to find better solutions to the optimization problem defined by the Morfessor Baseline model. The improved optimization also leads to higher morphological segmentation accuracy when compared to a linguistic gold standard.
Score: 14.116412358534442
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Data-driven segmentation of words into subword units has been used in various natural language processing applications such as automatic speech recognition and statistical machine translation for almost 20 years. Recently it has became more widely adopted, as models based on deep neural networks often benefit from subword units even for morphologically simpler languages. In this paper, we discuss and compare training algorithms for a unigram subword model, based on the Expectation Maximization algorithm and lexicon pruning. Using English, Finnish, North Sami, and Turkish data sets, we show that this approach is able to find better solutions to the optimization problem defined by the Morfessor Baseline model than its original recursive training algorithm. The improved optimization also leads to higher morphological segmentation accuracy when compared to a linguistic gold standard. We publish implementations of the new algorithms in the widely-used Morfessor software package.

Related papers

Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers. It is common to instead use proxy tasks that are similar in only an informal sense. We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z)
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z)
Lexically Grounded Subword Segmentation [0.0]
We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an method for obtaining subword embeddings grounded in a word embedding space. Third, we introduce an efficient segmentation algorithm based on a subword bigram model.
arXiv Detail & Related papers (2024-06-19T13:48:19Z)
SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT) This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method. SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z)
Tokenization with Factorized Subword Encoding [2.538209532048867]
We propose a novel tokenization method that factorizes subwords onto discrete triplets using a VQ-VAE model. Results indicate that this method is more appropriate and robust for morphological tasks than the commonly used byte-pair encoding (BPE) tokenization algorithm.
arXiv Detail & Related papers (2023-06-13T13:27:34Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
Subword Segmental Language Modelling for Nguni Languages [7.252933737829635]
Subword segmental language model (SSLM) learns how to segment words while being trained for autoregressive language modelling. We train our model on the 4 Nguni languages of South Africa. Our results show that learning subword segmentation is an effective alternative to existing subword segmenters.
arXiv Detail & Related papers (2022-10-12T18:41:00Z)
Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs. In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z)
Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations. Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.