Graph-Based Multilingual Label Propagation for Low-Resource
Part-of-Speech Tagging
- URL: http://arxiv.org/abs/2210.09840v1
- Date: Tue, 18 Oct 2022 13:26:09 GMT
- Title: Graph-Based Multilingual Label Propagation for Low-Resource
Part-of-Speech Tagging
- Authors: Ayyoob Imani, Silvia Severini, Masoud Jalili Sabet, Fran\c{c}ois Yvon,
Hinrich Sch\"utze
- Abstract summary: Part-of-Speech (POS) tagging is an important component of the NLP pipeline.
Many low-resource languages lack labeled data for training.
We propose a novel method for transferring labels from multiple high-resource source to low-resource target languages.
- Score: 0.44798341036073835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Part-of-Speech (POS) tagging is an important component of the NLP pipeline,
but many low-resource languages lack labeled data for training. An established
method for training a POS tagger in such a scenario is to create a labeled
training set by transferring from high-resource languages. In this paper, we
propose a novel method for transferring labels from multiple high-resource
source to low-resource target languages. We formalize POS tag projection as
graph-based label propagation. Given translations of a sentence in multiple
languages, we create a graph with words as nodes and alignment links as edges
by aligning words for all language pairs. We then propagate node labels from
source to target using a Graph Neural Network augmented with transformer
layers. We show that our propagation creates training sets that allow us to
train POS taggers for a diverse set of languages. When combined with enhanced
contextualized embeddings, our method achieves a new state-of-the-art for
unsupervised POS tagging of low-resource languages.
Related papers
- Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Universal Cross-Lingual Text Classification [0.3958317527488535]
This research proposes a novel perspective on Universal Cross-Lingual Text Classification.
Our approach involves blending supervised data from different languages during training to create a universal model.
The primary goal is to enhance label and language coverage, aiming for a label set that represents a union of labels from various languages.
arXiv Detail & Related papers (2024-06-16T17:58:29Z) - Zero Resource Cross-Lingual Part Of Speech Tagging [0.0]
Part of speech tagging in zero-resource settings can be an effective approach for low-resource languages when no labeled training data is available.
We evaluate transfer learning setup with English as a source language and French, German, and Spanish as target languages for part-of-speech tagging.
arXiv Detail & Related papers (2024-01-11T08:12:47Z) - Cross-Register Projection for Headline Part of Speech Tagging [3.5455943749695034]
We train a multi-domain POS tagger on both long-form and headline text.
We show that our model yields a 23% relative error reduction per token and 19% per headline.
We make POSH, the POS-tagged Headline corpus, available to encourage research in improved NLP models for news headlines.
arXiv Detail & Related papers (2021-09-15T18:00:02Z) - Cross-lingual alignments of ELMo contextual embeddings [0.0]
Cross-lingual embeddings map word embeddings from a low-resource language to a high-resource language.
To produce cross-lingual mappings of recent contextual embeddings, anchor points between the embedding spaces have to be words in the same context.
We propose novel cross-lingual mapping methods for ELMo embeddings.
arXiv Detail & Related papers (2021-06-30T11:26:43Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Cross-lingual Text Classification with Heterogeneous Graph Neural
Network [2.6936806968297913]
Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages.
Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks.
We propose a simple yet effective method to incorporate heterogeneous information within and across languages for cross-lingual text classification.
arXiv Detail & Related papers (2021-05-24T12:45:42Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Leveraging Adversarial Training in Self-Learning for Cross-Lingual Text
Classification [52.69730591919885]
We present a semi-supervised adversarial training process that minimizes the maximal loss for label-preserving input perturbations.
We observe significant gains in effectiveness on document and intent classification for a diverse set of languages.
arXiv Detail & Related papers (2020-07-29T19:38:35Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.