Related papers: Lexically Grounded Subword Segmentation

Lexically Grounded Subword Segmentation

URL: http://arxiv.org/abs/2406.13560v2
Date: Thu, 03 Oct 2024 11:17:43 GMT
Title: Lexically Grounded Subword Segmentation
Authors: Jindřich Libovický, Jindřich Helcl,
Abstract summary: We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an method for obtaining subword embeddings grounded in a word embedding space. Third, we introduce an efficient segmentation algorithm based on a subword bigram model.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present three innovations in tokenization and subword segmentation. First, we propose to use unsupervised morphological analysis with Morfessor as pre-tokenization. Second, we present an algebraic method for obtaining subword embeddings grounded in a word embedding space. Based on that, we design a novel subword segmentation algorithm that uses the embeddings, ensuring that the procedure considers lexical meaning. Third, we introduce an efficient segmentation algorithm based on a subword bigram model that can be initialized with the lexically aware segmentation method to avoid using Morfessor and large embedding tables at inference time. We evaluate the proposed approaches using two intrinsic metrics and measure their performance on two downstream tasks: part-of-speech tagging and machine translation. Our experiments show significant improvements in the morphological plausibility of the segmentation when evaluated using segmentation precision on morpheme boundaries and improved R\'enyi efficiency in 8 languages. Although the proposed tokenization methods do not have a large impact on automatic translation quality, we observe consistent performance gains in the arguably more morphological task of part-of-speech tagging.

Related papers

ByteSpan: Information-Driven Subword Tokenisation [2.4723044036055306]
We propose a new information-driven subword tokeniser, ByteSpan, that uses an external byte-level LM during training to identify contiguous predictable byte sequences.<n>Experiments show that ByteSpan yields efficient vocabularies with higher morphological alignment scores than BPE for English.
arXiv Detail & Related papers (2025-06-23T13:42:00Z)
Evaluating Subword Tokenization: Alien Subword Composition and OOV Generalization Challenge [10.721272718226848]
We propose a combined intrinsic-extrinsic evaluation framework for subword tokenization. Intrepid evaluation is based on our new UniMorph Labeller tool that classifies subword tokenization as either morphological or alien. Our empirical findings show that the accuracy of UniMorph Labeller is 98%, and that alien tokenization leads to poorer generalizations.
arXiv Detail & Related papers (2024-04-20T06:49:15Z)
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords. We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z)
SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT) This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method. SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z)
Where's the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation [65.6736056006381]
We present a multilingual punctuation-agnostic sentence segmentation method covering 85 languages. Our method outperforms all the prior best sentence-segmentation tools by an average of 6.1% F1 points. By using our method to match sentence segmentation to the segmentation used during training of MT models, we achieve an average improvement of 2.3 BLEU points.
arXiv Detail & Related papers (2023-05-30T09:49:42Z)
Subword Segmental Machine Translation: Unifying Segmentation and Target Sentence Generation [7.252933737829635]
Subword segmental machine translation (SSMT) learns to segment target sentence words while jointly learning to generate target sentences. Experiments across 6 translation directions show that SSMT improves chrF scores for morphologically rich agglutinative languages.
arXiv Detail & Related papers (2023-05-11T17:44:29Z)
Learning Context-aware Classifier for Semantic Segmentation [88.88198210948426]
In this paper, contextual hints are exploited via learning a context-aware classifier. Our method is model-agnostic and can be easily applied to generic segmentation models. With only negligible additional parameters and +2% inference time, decent performance gain has been achieved on both small and large models.
arXiv Detail & Related papers (2023-03-21T07:00:35Z)
Neural Token Segmentation for High Token-Internal Complexity [7.569526565230962]
Tokenizing raw texts into word units is an essential pre-processing step for NLP pipelines. We propose a novel neural segmentation model which combines contextualised token representation and char-level decoding. Our model shows substantial improvements in segmentation accuracy on Hebrew and Arabic compared to the state-of-the-art.
arXiv Detail & Related papers (2022-03-21T10:07:17Z)
Multi-view Subword Regularization [111.04350390045705]
Multi-view Subword Regularization (MVR) is a method that enforces the consistency between predictions of using inputs tokenized by the standard and probabilistic segmentations. Results on the XTREME multilingual benchmark show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.
arXiv Detail & Related papers (2021-03-15T16:07:42Z)
Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach. The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features. Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z)
Vietnamese Word Segmentation with SVM: Ambiguity Reduction and Suffix Capture [2.7528170226206443]
We propose two novel ways to feature extraction, one to reduce the overlap ambiguity and the other to increase the ability to predict unknown words containing suffixes. Our proposed method obtained a better F1-score than the prior state-of-the-art methods UETsegmenter, and RDRsegmenter.
arXiv Detail & Related papers (2020-06-14T05:19:46Z)
Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning [14.116412358534442]
We discuss and compare training algorithms for a unigram subword model, based on the Expectation Maximization algorithm and lexicon pruning. We show that this approach is able to find better solutions to the optimization problem defined by the Morfessor Baseline model. The improved optimization also leads to higher morphological segmentation accuracy when compared to a linguistic gold standard.
arXiv Detail & Related papers (2020-03-06T10:58:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.