Annotating Norwegian Language Varieties on Twitter for Part-of-Speech
- URL: http://arxiv.org/abs/2210.06150v1
- Date: Wed, 12 Oct 2022 12:53:30 GMT
- Title: Annotating Norwegian Language Varieties on Twitter for Part-of-Speech
- Authors: Petter M{\ae}hlum, Andre K{\aa}sen, Samia Touileb, Jeremy Barnes
- Abstract summary: We present a novel Norwegian Twitter dataset annotated with POS-tags.
We show that models trained on Universal Dependency (UD) data perform worse when evaluated against this dataset.
We also see that performance on dialectal tweets is comparable to the written standards for some models.
- Score: 14.031720101413557
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Norwegian Twitter data poses an interesting challenge for Natural Language
Processing (NLP) tasks. These texts are difficult for models trained on
standardized text in one of the two Norwegian written forms (Bokm{\aa}l and
Nynorsk), as they contain both the typical variation of social media text, as
well as a large amount of dialectal variety. In this paper we present a novel
Norwegian Twitter dataset annotated with POS-tags. We show that models trained
on Universal Dependency (UD) data perform worse when evaluated against this
dataset, and that models trained on Bokm{\aa}l generally perform better than
those trained on Nynorsk. We also see that performance on dialectal tweets is
comparable to the written standards for some models. Finally we perform a
detailed analysis of the errors that models commonly make on this data.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Boosting Norwegian Automatic Speech Recognition [0.0]
We present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokmaal and Nynorsk.
We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets.
We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10% to 7.60%, with models achieving 5.81% for Bokmaal and 11.54% for Nynorsk.
arXiv Detail & Related papers (2023-07-04T12:05:15Z) - NoCoLA: The Norwegian Corpus of Linguistic Acceptability [2.538209532048867]
We present two new Norwegian datasets for evaluating language models.
NoCoLA_class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences.
NoCoLA_zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner.
arXiv Detail & Related papers (2023-06-13T14:11:19Z) - Thutmose Tagger: Single-pass neural model for Inverse Text Normalization [76.87664008338317]
Inverse text normalization (ITN) is an essential post-processing step in automatic speech recognition.
We present a dataset preparation method based on the granular alignment of ITN examples.
One-to-one correspondence between tags and input words improves the interpretability of the model's predictions.
arXiv Detail & Related papers (2022-07-29T20:39:02Z) - Can Character-based Language Models Improve Downstream Task Performance
in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi.
We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Understanding by Understanding Not: Modeling Negation in Language Models [81.21351681735973]
Negation is a core construction in natural language.
We propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences.
We reduce the mean top1 error rate to 4% on the negated LAMA dataset.
arXiv Detail & Related papers (2021-05-07T21:58:35Z) - NorDial: A Preliminary Corpus of Written Norwegian Dialect Use [4.211128681972148]
We collect a small corpus of tweets and manually annotate them as Bokmaal, Nynorsk, any dialect, or a mix.
We perform preliminary experiments with state-of-the-art models, as well as an analysis of the data to expand this corpus in the future.
arXiv Detail & Related papers (2021-04-11T10:56:53Z) - From Universal Language Model to Downstream Task: Improving
RoBERTa-Based Vietnamese Hate Speech Detection [8.602181445598776]
We propose a pipeline to adapt the general-purpose RoBERTa language model to a specific text classification task: Vietnamese Hate Speech Detection.
Our experiments proved that our proposed pipeline boosts the performance significantly, achieving a new state-of-the-art on Vietnamese Hate Speech Detection campaign with 0.7221 F1 score.
arXiv Detail & Related papers (2021-02-24T09:30:55Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language
Explanations [87.71914254873857]
We present a data collection effort to correct the class with the highest error rate in SNLI-VE.
Thirdly, we introduce e-SNLI-VE, which appends human-written natural language explanations to SNLI-VE.
We train models that learn from these explanations at training time, and output such explanations at testing time.
arXiv Detail & Related papers (2020-04-07T23:12:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.