Exploring transfer learning for Deep NLP systems on rarely annotated languages
- URL: http://arxiv.org/abs/2410.12879v1
- Date: Tue, 15 Oct 2024 13:33:54 GMT
- Title: Exploring transfer learning for Deep NLP systems on rarely annotated languages
- Authors: Dipendra Yadav, Tobias Strauß, Kristina Yordanova,
- Abstract summary: This thesis investigates the application of transfer learning for Part-of-Speech (POS) tagging between Hindi and Nepali.
We assess whether multitask learning in Hindi, with auxiliary tasks such as gender and singular/plural tagging, can contribute to improved POS tagging accuracy.
- Score: 0.0
- License:
- Abstract: Natural language processing (NLP) has experienced rapid advancements with the rise of deep learning, significantly outperforming traditional rule-based methods. By capturing hidden patterns and underlying structures within data, deep learning has improved performance across various NLP tasks, overcoming the limitations of rule-based systems. However, most research and development in NLP has been concentrated on a select few languages, primarily those with large numbers of speakers or financial significance, leaving many others underexplored. This lack of research is often attributed to the scarcity of adequately annotated datasets essential for training deep learning models. Despite this challenge, there is potential in leveraging the linguistic similarities between unexplored and well-studied languages, particularly those in close geographic and linguistic proximity. This thesis investigates the application of transfer learning for Part-of-Speech (POS) tagging between Hindi and Nepali, two highly similar languages belonging to the Indo-Aryan language family. Specifically, the work explores whether joint training of a POS tagging model for both languages enhances performance. Additionally, we assess whether multitask learning in Hindi, with auxiliary tasks such as gender and singular/plural tagging, can contribute to improved POS tagging accuracy. The deep learning architecture employed is the BLSTM-CNN-CRF model, trained under different conditions: monolingual word embeddings, vector-mapped embeddings, and jointly trained Hindi-Nepali word embeddings. Varying dropout rates (0.25 to 0.5) and optimizers (ADAM and AdaDelta) are also evaluated. Results indicate that jointly trained Hindi-Nepali word embeddings improve performance across all models compared to monolingual and vector-mapped embeddings.
Related papers
- Multilingual Evaluation of Semantic Textual Relatedness [0.0]
Semantic Textual Relatedness (STR) goes beyond superficial word overlap, considering linguistic elements and non-linguistic factors like topic, sentiment, and perspective.
Prior NLP research has predominantly focused on English, limiting its applicability across languages.
We explore STR in Marathi, Hindi, Spanish, and English, unlocking the potential for information retrieval, machine translation, and more.
arXiv Detail & Related papers (2024-04-13T17:16:03Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - Automatic Readability Assessment for Closely Related Languages [6.233117407988574]
This work focuses on how linguistic aspects such as mutual intelligibility or degree of language relatedness can improve ARA in a low-resource setting.
We collect short stories written in three languages in the Philippines-Tagalog, Bikol, and Cebuano-to train readability assessment models.
Our results show that the inclusion of CrossNGO, a novel specialized feature exploiting n-gram overlap applied to languages with high mutual intelligibility, significantly improves the performance of ARA models.
arXiv Detail & Related papers (2023-05-22T20:42:53Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - AsPOS: Assamese Part of Speech Tagger using Deep Learning Approach [7.252817150901275]
Part of Speech (POS) tagging is crucial to Natural Language Processing (NLP)
We present a Deep Learning (DL)-based POS tagger for Assamese.
We attain a tagging accuracy of 86.52% in F1 score.
arXiv Detail & Related papers (2022-12-14T05:36:18Z) - On the cross-lingual transferability of multilingual prototypical models
across NLU tasks [2.44288434255221]
Supervised deep learning-based approaches have been applied to task-oriented dialog and have proven to be effective for limited domain and language applications.
In practice, these approaches suffer from the drawbacks of domain-driven design and under-resourced languages.
This article proposes to investigate the cross-lingual transferability of using synergistically few-shot learning with prototypical neural networks and multilingual Transformers-based models.
arXiv Detail & Related papers (2022-07-19T09:55:04Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.