Zero Resource Cross-Lingual Part Of Speech Tagging
- URL: http://arxiv.org/abs/2401.05727v1
- Date: Thu, 11 Jan 2024 08:12:47 GMT
- Title: Zero Resource Cross-Lingual Part Of Speech Tagging
- Authors: Sahil Chopra
- Abstract summary: Part of speech tagging in zero-resource settings can be an effective approach for low-resource languages when no labeled training data is available.
We evaluate transfer learning setup with English as a source language and French, German, and Spanish as target languages for part-of-speech tagging.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Part of speech tagging in zero-resource settings can be an effective approach
for low-resource languages when no labeled training data is available. Existing
systems use two main techniques for POS tagging i.e. pretrained multilingual
large language models(LLM) or project the source language labels into the zero
resource target language and train a sequence labeling model on it. We explore
the latter approach using the off-the-shelf alignment module and train a hidden
Markov model(HMM) to predict the POS tags. We evaluate transfer learning setup
with English as a source language and French, German, and Spanish as target
languages for part-of-speech tagging. Our conclusion is that projected
alignment data in zero-resource language can be beneficial to predict POS tags.
Related papers
- Recipe for Zero-shot POS Tagging: Is It Useful in Realistic Scenarios? [4.959459199361905]
This paper focuses on POS tagging for languages with limited data.
We seek to identify the characteristics of datasets that make them favourable for training POS tagging models without using any labelled training data from the target language.
arXiv Detail & Related papers (2024-10-14T14:51:13Z) - Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment [50.27950279695363]
The transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language.
Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method.
arXiv Detail & Related papers (2024-06-28T08:59:24Z) - Visual Speech Recognition for Languages with Limited Labeled Data using
Automatic Labels from Whisper [96.43501666278316]
This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages.
We employ a Whisper model which can conduct both language identification and audio-based speech recognition.
By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels.
arXiv Detail & Related papers (2023-09-15T16:53:01Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Cross-Lingual Transfer Learning for Phrase Break Prediction with
Multilingual Language Model [13.730152819942445]
Cross-lingual transfer learning can be particularly effective for improving performance in low-resource languages.
This suggests that cross-lingual transfer can be inexpensive and effective for developing TTS front-end in resource-poor languages.
arXiv Detail & Related papers (2023-06-05T04:10:04Z) - Graph-Based Multilingual Label Propagation for Low-Resource
Part-of-Speech Tagging [0.44798341036073835]
Part-of-Speech (POS) tagging is an important component of the NLP pipeline.
Many low-resource languages lack labeled data for training.
We propose a novel method for transferring labels from multiple high-resource source to low-resource target languages.
arXiv Detail & Related papers (2022-10-18T13:26:09Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages.
We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages.
The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z) - Improving Cross-Lingual Transfer Learning for End-to-End Speech
Recognition with Speech Translation [63.16500026845157]
We introduce speech-to-text translation as an auxiliary task to incorporate additional knowledge of the target language.
We show that training ST with human translations is not necessary.
Even with pseudo-labels from low-resource MT (200K examples), ST-enhanced transfer brings up to 8.9% WER reduction to direct transfer.
arXiv Detail & Related papers (2020-06-09T19:34:11Z) - Multilingual acoustic word embedding models for processing zero-resource
languages [37.78342106714364]
We train a single supervised embedding model on labelled data from multiple well-resourced languages.
We then apply it to unseen zero-resource languages.
arXiv Detail & Related papers (2020-02-06T05:53:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.