Related papers: A Little Pretraining Goes a Long Way: A Case Study on Dependency Parsing Task for Low-resource Morphologically Rich Languages

A Little Pretraining Goes a Long Way: A Case Study on Dependency Parsing Task for Low-resource Morphologically Rich Languages

URL: http://arxiv.org/abs/2102.06551v1
Date: Fri, 12 Feb 2021 14:26:58 GMT
Title: A Little Pretraining Goes a Long Way: A Case Study on Dependency Parsing Task for Low-resource Morphologically Rich Languages
Authors: Jivnesh Sandhan, Amrith Krishna, Ashim Gupta, Laxmidhar Behera and Pawan Goyal
Abstract summary: We focus on dependency parsing for morphological rich languages (MRLs) in a low-resource setting. To address these challenges, we propose simple auxiliary tasks for pretraining. We perform experiments on 10 MRLs in low-resource settings to measure the efficacy of our proposed pretraining method.
Score: 14.694800341598368
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neural dependency parsing has achieved remarkable performance for many domains and languages. The bottleneck of massive labeled data limits the effectiveness of these approaches for low resource languages. In this work, we focus on dependency parsing for morphological rich languages (MRLs) in a low-resource setting. Although morphological information is essential for the dependency parsing task, the morphological disambiguation and lack of powerful analyzers pose challenges to get this information for MRLs. To address these challenges, we propose simple auxiliary tasks for pretraining. We perform experiments on 10 MRLs in low-resource settings to measure the efficacy of our proposed pretraining method and observe an average absolute gain of 2 points (UAS) and 3.6 points (LAS). Code and data available at: https://github.com/jivnesh/LCM

Related papers

Small Models, Big Impact: Efficient Corpus and Graph-Based Adaptation of Small Multilingual Language Models for Low-Resource Languages [10.418542753869433]
Low-resource languages (LRLs) face significant challenges in natural language processing (NLP) due to limited data. Current state-of-the-art large language models (LLMs) still struggle with LRLs. Small multilingual models (mLMs) such as mBERT and XLM-R offer greater promise due to a better fit of their capacity to low training data sizes.
arXiv Detail & Related papers (2025-02-14T13:10:39Z)
Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages. For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively. We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z)
From Priest to Doctor: Domain Adaptation for Low-Resource Neural Machine Translation [3.666125285899499]
Many languages have insufficient data to train high-performing general neural machine translation (NMT) models. Many of the world's languages have insufficient data to train high-performing general neural machine translation (NMT) models.
arXiv Detail & Related papers (2024-12-01T21:06:08Z)
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning [62.984693936073974]
Value-based reinforcement learning can learn effective policies for a wide range of multi-turn problems. Current value-based RL methods have proven particularly challenging to scale to the setting of large language models. We propose a novel offline RL algorithm that addresses these drawbacks, casting Q-learning as a modified supervised fine-tuning problem.
arXiv Detail & Related papers (2024-11-07T21:36:52Z)
TriSum: Learning Summarization Ability from Large Language Models with Structured Rationale [66.01943465390548]
We introduce TriSum, a framework for distilling large language models' text summarization abilities into a compact, local model. Our method enhances local model performance on various benchmarks. It also improves interpretability by providing insights into the summarization rationale.
arXiv Detail & Related papers (2024-03-15T14:36:38Z)
GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z)
Effective Transfer Learning for Low-Resource Natural Language Understanding [15.752309656576129]
We focus on developing cross-lingual and cross-domain methods to tackle the low-resource issues. First, we propose to improve the model's cross-lingual ability by focusing on the task-related keywords. Second, we present Order-Reduced Modeling methods for the cross-lingual adaptation. Third, we propose to leverage different levels of domain-related corpora and additional masking of data in the pre-training for the cross-domain adaptation.
arXiv Detail & Related papers (2022-08-19T06:59:00Z)
Morphological Processing of Low-Resource Languages: Where We Are and What's Next [23.7371787793763]
We focus on approaches suitable for languages with minimal or no annotated resources. We argue that the field is ready to tackle the logical next challenge: understanding a language's morphology from raw text alone.
arXiv Detail & Related papers (2022-03-16T19:47:04Z)
Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore. We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)
Variational Information Bottleneck for Effective Low-Resource Fine-Tuning [40.66716433803935]
We propose to use Variational Information Bottleneck (VIB) to suppress irrelevant features when fine-tuning on low-resource target tasks. We show that our VIB model finds sentence representations that are more robust to biases in natural language inference datasets.
arXiv Detail & Related papers (2021-06-10T03:08:13Z)
Adversarial Meta Sampling for Multilingual Low-Resource Speech Recognition [159.9312272042253]
We develop a novel adversarial meta sampling (AMS) approach to improve multilingual meta-learning ASR (MML-ASR) AMS adaptively determines the task sampling probability for each source language. Experiment results on two multilingual datasets show significant performance improvement when applying our AMS on MML-ASR.
arXiv Detail & Related papers (2020-12-22T09:33:14Z)
Low-Resource Adaptation of Neural NLP Models [0.30458514384586405]
This thesis investigates methods for dealing with low-resource scenarios in information extraction and natural language understanding. We develop and adapt neural NLP models to explore a number of research questions concerning NLP tasks with minimal or no training data.
arXiv Detail & Related papers (2020-11-09T12:13:55Z)
A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios [30.391291221959545]
Deep neural networks and huge language models are becoming omnipresent in natural language applications. As they are known for requiring large amounts of training data, there is a growing body of work to improve the performance in low-resource settings. Motivated by the recent fundamental changes towards neural models and the popular pre-train and fine-tune paradigm, we survey promising approaches for low-resource natural language processing.
arXiv Detail & Related papers (2020-10-23T11:22:01Z)
Low-Resource Domain Adaptation for Compositional Task-Oriented Semantic Parsing [85.35582118010608]
Task-oriented semantic parsing is a critical component of virtual assistants. Recent advances in deep learning have enabled several approaches to successfully parse more complex queries. We propose a novel method that outperforms a supervised neural model at a 10-fold data reduction.
arXiv Detail & Related papers (2020-10-07T17:47:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.