Context based lemmatizer for Polish language
- URL: http://arxiv.org/abs/2207.11565v1
- Date: Sat, 23 Jul 2022 18:02:16 GMT
- Title: Context based lemmatizer for Polish language
- Authors: Michal Karwatowski and Marcin Pietron
- Abstract summary: Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item.
In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning.
The model achieves the best results for polish language lemmatisation process.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Lemmatization is the process of grouping together the inflected forms of a
word so they can be analysed as a single item, identified by the word's lemma,
or dictionary form. In computational linguistics, lemmatisation is the
algorithmic process of determining the lemma of a word based on its intended
meaning. Unlike stemming, lemmatisation depends on correctly identifying the
intended part of speech and meaning of a word in a sentence, as well as within
the larger context surrounding that sentence. As a result, developing efficient
lemmatisation algorithm is the complex task. In recent years it can be observed
that deep learning models used for this task outperform other methods including
machine learning algorithms. In this paper the polish lemmatizer based on
Google T5 model is presented. The training was run with different context
lengths. The model achieves the best results for polish language lemmatisation
process.
Related papers
- Pixel Sentence Representation Learning [67.4775296225521]
In this work, we conceptualize the learning of sentence-level textual semantics as a visual representation learning process.
We employ visually-grounded text perturbation methods like typos and word order shuffling, resonating with human cognitive patterns, and enabling perturbation to be perceived as continuous.
Our approach is further bolstered by large-scale unsupervised topical alignment training and natural language inference supervision.
arXiv Detail & Related papers (2024-02-13T02:46:45Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - The boundaries of meaning: a case study in neural machine translation [0.0]
Subword segmentation algorithms are widely employed in language modeling, machine translation, and other tasks since 2016.
These algorithms often cut words into semantically opaque pieces, such as 'period', 'on', 't', and 'ist'
arXiv Detail & Related papers (2022-10-02T20:26:20Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - DEIM: An effective deep encoding and interaction model for sentence
matching [0.0]
We propose a sentence matching method based on deep encoding and interaction to extract deep semantic information.
In the encoder layer,we refer to the information of another sentence in the process of encoding a single sentence, and later use a algorithm to fuse the information.
In the interaction layer, we use a bidirectional attention mechanism and a self-attention mechanism to obtain deep semantic information.
arXiv Detail & Related papers (2022-03-20T07:59:42Z) - Studying word order through iterative shuffling [14.530986799844873]
We show that word order encodes meaning essential to performing NLP benchmark tasks.
We use IBIS, a novel, efficient procedure that finds the ordering of a bag of words having the highest likelihood under a fixed language model.
We discuss how shuffling inference procedures such as IBIS can benefit language modeling and constrained generation.
arXiv Detail & Related papers (2021-09-10T13:27:06Z) - Generalized Optimal Linear Orders [9.010643838773477]
The sequential structure of language, and the order of words in a sentence specifically, plays a central role in human language processing.
In designing computational models of language, the de facto approach is to present sentences to machines with the words ordered in the same order as in the original human-authored sentence.
The very essence of this work is to question the implicit assumption that this is desirable and inject theoretical soundness into the consideration of word order in natural language processing.
arXiv Detail & Related papers (2021-08-13T13:10:15Z) - SLAM-Inspired Simultaneous Contextualization and Interpreting for
Incremental Conversation Sentences [0.0]
We propose a method to dynamically estimate the context and interpretations of polysemous words in sequential sentences.
By using the SCAIN algorithm, we can sequentially optimize the interdependence between context and word interpretation while obtaining new interpretations online.
arXiv Detail & Related papers (2020-05-29T16:40:27Z) - Words aren't enough, their order matters: On the Robustness of Grounding
Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition.
We show that 83.7% of test instances do not require reasoning on linguistic structure.
We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z) - Consistency of a Recurrent Language Model With Respect to Incomplete
Decoding [67.54760086239514]
We study the issue of receiving infinite-length sequences from a recurrent language model.
We propose two remedies which address inconsistency: consistent variants of top-k and nucleus sampling, and a self-terminating recurrent language model.
arXiv Detail & Related papers (2020-02-06T19:56:15Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.