Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic
- URL: http://arxiv.org/abs/2004.07776v1
- Date: Thu, 16 Apr 2020 17:11:02 GMT
- Title: Kvistur 2.0: a BiLSTM Compound Splitter for Icelandic
- Authors: J\'on Fri{\dh}rik Da{\dh}ason, David Erik Mollberg, Hrafn Loftsson,
Krist\'in Bjarnad\'ottir
- Abstract summary: We present a character-based BiLSTM model for splitting Icelandic compound words.
We show how varying amounts of training data affects the performance of the model.
- Score: 2.6763498831034043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a character-based BiLSTM model for splitting
Icelandic compound words, and show how varying amounts of training data affects
the performance of the model. Compounding is highly productive in Icelandic,
and new compounds are constantly being created. This results in a large number
of out-of-vocabulary (OOV) words, negatively impacting the performance of many
NLP tools. Our model is trained on a dataset of 2.9 million unique word forms
and their constituent structures from the Database of Icelandic Morphology. The
model learns how to split compound words into two parts and can be used to
derive the constituent structure of any word form. Knowing the constituent
structure of a word form makes it possible to generate the optimal split for a
given task, e.g., a full split for subword tokenization, or, in the case of
part-of-speech tagging, splitting an OOV word until the largest known
morphological head is found. The model outperforms other previously published
methods when evaluated on a corpus of manually split word forms. This method
has been integrated into Kvistur, an Icelandic compound word analyzer.
Related papers
- Unsupervised Morphological Tree Tokenizer [36.584680344291556]
We introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words.
Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named $textitOverriding$ to ensure the indecomposability of morphemes.
Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner.
arXiv Detail & Related papers (2024-06-21T15:35:49Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Subword Segmental Language Modelling for Nguni Languages [7.252933737829635]
Subword segmental language model (SSLM) learns how to segment words while being trained for autoregressive language modelling.
We train our model on the 4 Nguni languages of South Africa.
Our results show that learning subword segmentation is an effective alternative to existing subword segmenters.
arXiv Detail & Related papers (2022-10-12T18:41:00Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Robust and Consistent Estimation of Word Embedding for Bangla Language
by fine-tuning Word2Vec Model [1.2691047660244335]
We analyze word2vec model for learning word vectors and present the most effective word embedding for Bangla language.
We cluster the word vectors to examine the relational similarity of words for intrinsic evaluation and also use different word embeddings as the feature of news article for extrinsic evaluation.
arXiv Detail & Related papers (2020-10-26T08:00:48Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Morphological Skip-Gram: Using morphological knowledge to improve word
representation [2.0129974477913457]
We propose a new method for training word embeddings by replacing the FastText bag of character n-grams for a bag of word morphemes.
The results show a competitive performance compared to FastText.
arXiv Detail & Related papers (2020-07-20T12:47:36Z) - Multidirectional Associative Optimization of Function-Specific Word
Representations [86.87082468226387]
We present a neural framework for learning associations between interrelated groups of words.
Our model induces a joint function-specific word vector space, where vectors of e.g. plausible SVO compositions lie close together.
The model retains information about word group membership even in the joint space, and can thereby effectively be applied to a number of tasks reasoning over the SVO structure.
arXiv Detail & Related papers (2020-05-11T17:07:20Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.