Lexical Normalization for Code-switched Data and its Effect on
POS-tagging
- URL: http://arxiv.org/abs/2006.01175v2
- Date: Sun, 31 Jan 2021 20:57:52 GMT
- Title: Lexical Normalization for Code-switched Data and its Effect on
POS-tagging
- Authors: Rob van der Goot, \"Ozlem \c{C}etino\u{g}lu
- Abstract summary: We propose three normalization models specifically designed to handle code-switched data.
We introduce novel normalization layers and their corresponding language ID and POS tags for the dataset.
Results show that our CS-tailored normalization models outperform Id-En state of the art and Tr-De monolingual models.
- Score: 8.875272663730868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lexical normalization, the translation of non-canonical data to standard
language, has shown to improve the performance of manynatural language
processing tasks on social media. Yet, using multiple languages in one
utterance, also called code-switching (CS), is frequently overlooked by these
normalization systems, despite its common use in social media. In this paper,
we propose three normalization models specifically designed to handle
code-switched data which we evaluate for two language pairs: Indonesian-English
(Id-En) and Turkish-German (Tr-De). For the latter, we introduce novel
normalization layers and their corresponding language ID and POS tags for the
dataset, and evaluate the downstream effect of normalization on POS tagging.
Results show that our CS-tailored normalization models outperform Id-En state
of the art and Tr-De monolingual models, and lead to 5.4% relative performance
increase for POS tagging as compared to unnormalized input.
Related papers
- Neural Text Normalization for Luxembourgish using Real-Life Variation Data [21.370964546752294]
We propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures.
We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.
arXiv Detail & Related papers (2024-12-12T15:50:55Z) - Improving Domain-Specific Retrieval by NLI Fine-Tuning [64.79760042717822]
This article investigates the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking.
We employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data.
Our results point to the fact that NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models.
arXiv Detail & Related papers (2023-08-06T12:40:58Z) - Script Normalization for Unconventional Writing of Under-Resourced
Languages in Bilingual Communities [36.578851892373365]
Social media has provided linguistically under-represented communities with an extraordinary opportunity to create content in their native languages.
This paper addresses the problem of script normalization for several such languages that are mainly written in a Perso-Arabic script.
Using synthetic data with various levels of noise and a transformer-based model, we demonstrate that the problem can be effectively remediated.
arXiv Detail & Related papers (2023-05-25T18:18:42Z) - Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on
POS Tagging for Non-Standardized Languages [18.210880703295253]
We finetune pretrained language models (PLMs) on seven languages from three different families.
We analyze their zero-shot performance on closely related, non-standardized varieties.
Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data is the strongest predictor for model performance on target data.
arXiv Detail & Related papers (2023-04-20T08:32:34Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Sequence-to-Sequence Lexical Normalization with Multilingual
Transformers [3.3302293148249125]
Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication.
This discrepancy has led to severe performance degradation of state-of-the-art NLP models when fine-tuned on real-world data.
We propose a sentence-level sequence-to-sequence model based on mBART, which frames the problem as a machine translation problem.
arXiv Detail & Related papers (2021-10-06T15:53:20Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - Consistency Regularization for Cross-Lingual Fine-Tuning [61.08704789561351]
We propose to improve cross-lingual fine-tuning with consistency regularization.
Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations.
Experimental results on the XTREME benchmark show that our method significantly improves cross-lingual fine-tuning across various tasks.
arXiv Detail & Related papers (2021-06-15T15:35:44Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.