Related papers: Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation

Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation

URL: http://arxiv.org/abs/2401.16078v1
Date: Mon, 29 Jan 2024 11:39:46 GMT
Title: Understanding the effects of word-level linguistic annotations in under-resourced neural machine translation
Authors: V\'ictor M. S\'anchez-Cartagena, Juan Antonio P\'erez-Ortiz, Felipe S\'anchez-Mart\'inez
Abstract summary: This paper studies the effects of word-level linguistic annotations in under-resourced neural machine translation. Part-of-speech tags systematically outperform morpho-syntactic description tags in terms of automatic evaluation metrics.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper studies the effects of word-level linguistic annotations in under-resourced neural machine translation, for which there is incomplete evidence in the literature. The study covers eight language pairs, different training corpus sizes, two architectures, and three types of annotation: dummy tags (with no linguistic information at all), part-of-speech tags, and morpho-syntactic description tags, which consist of part of speech and morphological features. These linguistic annotations are interleaved in the input or output streams as a single tag placed before each word. In order to measure the performance under each scenario, we use automatic evaluation metrics and perform automatic error classification. Our experiments show that, in general, source-language annotations are helpful and morpho-syntactic descriptions outperform part of speech for some language pairs. On the contrary, when words are annotated in the target language, part-of-speech tags systematically outperform morpho-syntactic description tags in terms of automatic evaluation metrics, even though the use of morpho-syntactic description tags improves the grammaticality of the output. We provide a detailed analysis of the reasons behind this result.

Related papers

Urdu Dependency Parsing and Treebank Development: A Syntactic and Morphological Perspective [0.0]
We use dependency parsing to analyze news articles in Urdu. We achieve a best-labeled accuracy (LA) of 70% and an unlabeled attachment score (UAS) of 84%.
arXiv Detail & Related papers (2024-06-13T19:30:32Z)
Sentiment-Aware Word and Sentence Level Pre-training for Sentiment Analysis [64.70116276295609]
SentiWSP is a Sentiment-aware pre-trained language model with combined Word-level and Sentence-level Pre-training tasks. SentiWSP achieves new state-of-the-art performance on various sentence-level and aspect-level sentiment classification benchmarks.
arXiv Detail & Related papers (2022-10-18T12:25:29Z)
Multilingual Word Sense Disambiguation with Unified Sense Representation [55.3061179361177]
We propose building knowledge and supervised-based Multilingual Word Sense Disambiguation (MWSD) systems. We build unified sense representations for multiple languages and address the annotation scarcity problem for MWSD by transferring annotations from rich-sourced languages to poorer ones. Evaluations of SemEval-13 and SemEval-15 datasets demonstrate the effectiveness of our methodology.
arXiv Detail & Related papers (2022-10-14T01:24:03Z)
Multilingual Extraction and Categorization of Lexical Collocations with Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context. Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z)
AUTOLEX: An Automatic Framework for Linguistic Exploration [93.89709486642666]
We propose an automatic framework that aims to ease linguists' discovery and extraction of concise descriptions of linguistic phenomena. Specifically, we apply this framework to extract descriptions for three phenomena: morphological agreement, case marking, and word order. We evaluate the descriptions with the help of language experts and propose a method for automated evaluation when human evaluation is infeasible.
arXiv Detail & Related papers (2022-03-25T20:37:30Z)
On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar. We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods. Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z)
On the Impact of Knowledge-based Linguistic Annotations in the Quality of Scientific Embeddings [0.0]
We conduct a study on the use of explicit linguistic annotations to generate embeddings from a scientific corpus. Our results show how the effect of such annotations in the embeddings varies depending on the evaluation task. In general, we observe that learning embeddings using linguistic annotations contributes to achieve better evaluation results.
arXiv Detail & Related papers (2021-04-13T13:51:22Z)
Sparsely Factored Neural Machine Translation [3.4376560669160394]
A standard approach to incorporate linguistic information to neural machine translation systems consists in maintaining separate vocabularies for each of the annotated features. We propose a method suited for such a case, showing large improvements in out-of-domain data, and comparable quality for the in-domain data. Experiments are performed in morphologically-rich languages like Basque and German, for the case of low-resource scenarios.
arXiv Detail & Related papers (2021-02-17T18:42:00Z)
Deep Subjecthood: Higher-Order Grammatical Features in Multilingual BERT [7.057643880514415]
We investigate how Multilingual BERT (mBERT) encodes grammar by examining how the high-order grammatical feature of morphosyntactic alignment is manifested across the embedding spaces of different languages.
arXiv Detail & Related papers (2021-01-26T19:21:59Z)
Neural disambiguation of lemma and part of speech in morphologically rich languages [0.6346772579930928]
We consider the problem of disambiguating the lemma and part of speech of ambiguous words in morphologically rich languages. We propose a method for disambiguating ambiguous words in context, using a large un-annotated corpus of text, and a morphological analyser.
arXiv Detail & Related papers (2020-07-12T21:48:52Z)
On the Importance of Word Order Information in Cross-lingual Sequence Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages. We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.