Understanding the effects of word-level linguistic annotations in
under-resourced neural machine translation
- URL: http://arxiv.org/abs/2401.16078v1
- Date: Mon, 29 Jan 2024 11:39:46 GMT
- Title: Understanding the effects of word-level linguistic annotations in
under-resourced neural machine translation
- Authors: V\'ictor M. S\'anchez-Cartagena, Juan Antonio P\'erez-Ortiz, Felipe
S\'anchez-Mart\'inez
- Abstract summary: This paper studies the effects of word-level linguistic annotations in under-resourced neural machine translation.
Part-of-speech tags systematically outperform morpho-syntactic description tags in terms of automatic evaluation metrics.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the effects of word-level linguistic annotations in
under-resourced neural machine translation, for which there is incomplete
evidence in the literature. The study covers eight language pairs, different
training corpus sizes, two architectures, and three types of annotation: dummy
tags (with no linguistic information at all), part-of-speech tags, and
morpho-syntactic description tags, which consist of part of speech and
morphological features. These linguistic annotations are interleaved in the
input or output streams as a single tag placed before each word. In order to
measure the performance under each scenario, we use automatic evaluation
metrics and perform automatic error classification. Our experiments show that,
in general, source-language annotations are helpful and morpho-syntactic
descriptions outperform part of speech for some language pairs. On the
contrary, when words are annotated in the target language, part-of-speech tags
systematically outperform morpho-syntactic description tags in terms of
automatic evaluation metrics, even though the use of morpho-syntactic
description tags improves the grammaticality of the output. We provide a
detailed analysis of the reasons behind this result.
Related papers
- Urdu Dependency Parsing and Treebank Development: A Syntactic and Morphological Perspective [0.0]
We use dependency parsing to analyze news articles in Urdu.
We achieve a best-labeled accuracy (LA) of 70% and an unlabeled attachment score (UAS) of 84%.
arXiv Detail & Related papers (2024-06-13T19:30:32Z) - Sentiment-Aware Word and Sentence Level Pre-training for Sentiment
Analysis [64.70116276295609]
SentiWSP is a Sentiment-aware pre-trained language model with combined Word-level and Sentence-level Pre-training tasks.
SentiWSP achieves new state-of-the-art performance on various sentence-level and aspect-level sentiment classification benchmarks.
arXiv Detail & Related papers (2022-10-18T12:25:29Z) - Multilingual Word Sense Disambiguation with Unified Sense Representation [55.3061179361177]
We propose building knowledge and supervised-based Multilingual Word Sense Disambiguation (MWSD) systems.
We build unified sense representations for multiple languages and address the annotation scarcity problem for MWSD by transferring annotations from rich-sourced languages to poorer ones.
Evaluations of SemEval-13 and SemEval-15 datasets demonstrate the effectiveness of our methodology.
arXiv Detail & Related papers (2022-10-14T01:24:03Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - AUTOLEX: An Automatic Framework for Linguistic Exploration [93.89709486642666]
We propose an automatic framework that aims to ease linguists' discovery and extraction of concise descriptions of linguistic phenomena.
Specifically, we apply this framework to extract descriptions for three phenomena: morphological agreement, case marking, and word order.
We evaluate the descriptions with the help of language experts and propose a method for automated evaluation when human evaluation is infeasible.
arXiv Detail & Related papers (2022-03-25T20:37:30Z) - On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar.
We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods.
Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z) - On the Impact of Knowledge-based Linguistic Annotations in the Quality
of Scientific Embeddings [0.0]
We conduct a study on the use of explicit linguistic annotations to generate embeddings from a scientific corpus.
Our results show how the effect of such annotations in the embeddings varies depending on the evaluation task.
In general, we observe that learning embeddings using linguistic annotations contributes to achieve better evaluation results.
arXiv Detail & Related papers (2021-04-13T13:51:22Z) - Sparsely Factored Neural Machine Translation [3.4376560669160394]
A standard approach to incorporate linguistic information to neural machine translation systems consists in maintaining separate vocabularies for each of the annotated features.
We propose a method suited for such a case, showing large improvements in out-of-domain data, and comparable quality for the in-domain data.
Experiments are performed in morphologically-rich languages like Basque and German, for the case of low-resource scenarios.
arXiv Detail & Related papers (2021-02-17T18:42:00Z) - Deep Subjecthood: Higher-Order Grammatical Features in Multilingual BERT [7.057643880514415]
We investigate how Multilingual BERT (mBERT) encodes grammar by examining how the high-order grammatical feature of morphosyntactic alignment is manifested across the embedding spaces of different languages.
arXiv Detail & Related papers (2021-01-26T19:21:59Z) - Neural disambiguation of lemma and part of speech in morphologically
rich languages [0.6346772579930928]
We consider the problem of disambiguating the lemma and part of speech of ambiguous words in morphologically rich languages.
We propose a method for disambiguating ambiguous words in context, using a large un-annotated corpus of text, and a morphological analyser.
arXiv Detail & Related papers (2020-07-12T21:48:52Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.