Enhancing Sequence-to-Sequence Neural Lemmatization with External
Resources
- URL: http://arxiv.org/abs/2101.12056v1
- Date: Thu, 28 Jan 2021 15:14:20 GMT
- Title: Enhancing Sequence-to-Sequence Neural Lemmatization with External
Resources
- Authors: Kirill Milintsevich and Kairit Sirts
- Abstract summary: We propose a novel hybrid approach to lemmatization that enhances the seq2seq neural model with additional lemmas extracted from an external lexicon or a rule-based system.
During training, the enhanced lemmatizer learns both to generate lemmas via a sequential decoder and copy the lemma characters from the external candidates supplied during run-time.
Our lemmatizer enhanced with candidates extracted from the Apertium morphological analyzer achieves statistically significant improvements compared to baseline models not utilizing additional lemma information.
- Score: 0.6726255259929496
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel hybrid approach to lemmatization that enhances the seq2seq
neural model with additional lemmas extracted from an external lexicon or a
rule-based system. During training, the enhanced lemmatizer learns both to
generate lemmas via a sequential decoder and copy the lemma characters from the
external candidates supplied during run-time. Our lemmatizer enhanced with
candidates extracted from the Apertium morphological analyzer achieves
statistically significant improvements compared to baseline models not
utilizing additional lemma information, achieves an average accuracy of 97.25%
on a set of 23 UD languages, which is 0.55% higher than obtained with the
Stanford Stanza model on the same set of languages. We also compare with other
methods of integrating external data into lemmatization and show that our
enhanced system performs considerably better than a simple lexicon extension
method based on the Stanza system, and it achieves complementary improvements
w.r.t. the data augmentation method.
Related papers
- GliLem: Leveraging GliNER for Contextualized Lemmatization in Estonian [0.21485350418225246]
We present GliLem, a novel hybrid lemmatization system for Estonian.
We leverage the flexibility of a pre-trained GliNER model to improve the lemmatization accuracy of Vabamorf.
arXiv Detail & Related papers (2024-12-29T22:02:00Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Entropy-Based Decoding for Retrieval-Augmented Large Language Models [43.93281157539377]
Augmenting Large Language Models with retrieved external knowledge has proven effective for improving the factual accuracy of generated responses.
We introduce a novel, training-free decoding method guided by entropy considerations to mitigate this issue.
arXiv Detail & Related papers (2024-06-25T12:59:38Z) - In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder
for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE)
It encodes the text corpus into a latent space, capturing current and future information from both source and target text.
Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z) - Data Augmentation for Neural Machine Translation using Generative
Language Model [1.5500145658862499]
The scarcity of large parallel corpora remains the main bottleneck in Neural Machine Translation.
Data augmentation is a technique that enhances the performance of data-hungry models by generating synthetic data instead of collecting new ones.
We explore prompt-based data augmentation approaches that leverage large-scale language models such as ChatGPT.
arXiv Detail & Related papers (2023-07-26T02:12:58Z) - External Language Model Integration for Factorized Neural Transducers [7.5969913968845155]
We propose an adaptation method for factorized neural transducers (FNT) with external language models.
We show average gains of 18% WERR with lexical adaptation across various scenarios and additive gains of up to 60% WERR in one entity-rich scenario.
arXiv Detail & Related papers (2023-05-26T23:30:21Z) - Nearest Neighbor Zero-Shot Inference [68.56747574377215]
kNN-Prompt is a technique to use k-nearest neighbor (kNN) retrieval augmentation for zero-shot inference with language models (LMs)
fuzzy verbalizers leverage the sparse kNN distribution for downstream tasks by automatically associating each classification label with a set of natural language tokens.
Experiments show that kNN-Prompt is effective for domain adaptation with no further training, and that the benefits of retrieval increase with the size of the model used for kNN retrieval.
arXiv Detail & Related papers (2022-05-27T07:00:59Z) - Investigating Lexical Replacements for Arabic-English Code-Switched Data
Augmentation [32.885722714728765]
We investigate data augmentation techniques for code-switching (CS) NLP systems.
We perform lexical replacements using word-aligned parallel corpora.
We compare these approaches against dictionary-based replacements.
arXiv Detail & Related papers (2022-05-25T10:44:36Z) - Exploiting Language Model for Efficient Linguistic Steganalysis: An
Empirical Study [23.311007481830647]
We present two methods to efficient linguistic steganalysis.
One is to pre-train a language model based on RNN, and the other is to pre-train a sequence autoencoder.
arXiv Detail & Related papers (2021-07-26T12:37:18Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Discrete Variational Attention Models for Language Generation [51.88612022940496]
We propose a discrete variational attention model with categorical distribution over the attention mechanism owing to the discrete nature in languages.
Thanks to the property of discreteness, the training of our proposed approach does not suffer from posterior collapse.
arXiv Detail & Related papers (2020-04-21T05:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.