CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models
- URL: http://arxiv.org/abs/2305.14214v2
- Date: Mon, 23 Oct 2023 11:17:53 GMT
- Title: CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models
- Authors: Benjamin Minixhofer, Jonas Pfeiffer, Ivan Vuli\'c
- Abstract summary: We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
- Score: 77.45934004406283
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While many languages possess processes of joining two or more words to create
compound words, previous studies have been typically limited only to languages
with excessively productive compound formation (e.g., German, Dutch) and there
is no public dataset containing compound and non-compound words across a large
number of languages. In this work, we systematically study decompounding, the
task of splitting compound words into their constituents, at a wide scale. We
first address the data gap by introducing a dataset of 255k compound and
non-compound words across 56 diverse languages obtained from Wiktionary. We
then use this dataset to evaluate an array of Large Language Models (LLMs) on
the decompounding task. We find that LLMs perform poorly, especially on words
which are tokenized unfavorably by subword tokenization. We thus introduce a
novel methodology to train dedicated models for decompounding. The proposed
two-stage procedure relies on a fully self-supervised objective in the first
stage, while the second, supervised learning stage optionally fine-tunes the
model on the annotated Wiktionary data. Our self-supervised models outperform
the prior best unsupervised decompounding models by 13.9% accuracy on average.
Our fine-tuned models outperform all prior (language-specific) decompounding
tools. Furthermore, we use our models to leverage decompounding during the
creation of a subword tokenizer, which we refer to as CompoundPiece.
CompoundPiece tokenizes compound words more favorably on average, leading to
improved performance on decompounding over an otherwise equivalent model using
SentencePiece tokenization.
Related papers
- Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task
Learning [8.052271364177988]
Subword tokenization is a commonly used input pre-processing step in most recent NLP models.
We propose a vocabulary-free neural tokenizer by distilling segmentation information from subword tokenization.
Our tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks.
arXiv Detail & Related papers (2022-04-22T16:50:49Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic [2.6763498831034043]
We present a character-based BiLSTM model for splitting Icelandic compound words.
We show how varying amounts of training data affects the performance of the model.
arXiv Detail & Related papers (2020-04-16T17:11:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.