Systematic Investigation of Strategies Tailored for Low-Resource
Settings for Sanskrit Dependency Parsing
- URL: http://arxiv.org/abs/2201.11374v1
- Date: Thu, 27 Jan 2022 08:24:53 GMT
- Title: Systematic Investigation of Strategies Tailored for Low-Resource
Settings for Sanskrit Dependency Parsing
- Authors: Jivnesh Sandhan, Laxmidhar Behera and Pawan Goyal
- Abstract summary: Existing state of the art approaches for Sanskrit Dependency Parsing (SDP) are hybrid in nature.
purely data-driven approaches do not match the performance of hybrid approaches due to labelled data sparsity.
We experiment with five strategies, namely, data augmentation, sequential transfer learning, cross-lingual/mono-lingual pretraining, multi-task learning and self-training.
Our proposed ensembled system outperforms the purely data-driven state of the art system by 2.8/3.9 points (Unlabelled Attachment Score (UAS)/Labelled Attachment Score (LAS)) absolute gain
- Score: 14.416855042499945
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing state of the art approaches for Sanskrit Dependency Parsing (SDP),
are hybrid in nature, and rely on a lexicon-driven shallow parser for
linguistically motivated feature engineering. However, these methods fail to
handle out of vocabulary (OOV) words, which limits their applicability in
realistic scenarios. On the other hand, purely data-driven approaches do not
match the performance of hybrid approaches due to the labelled data sparsity.
Thus, in this work, we investigate the following question: How far can we push
a purely data-driven approach using recently proposed strategies for
low-resource settings? We experiment with five strategies, namely, data
augmentation, sequential transfer learning, cross-lingual/mono-lingual
pretraining, multi-task learning and self-training. Our proposed ensembled
system outperforms the purely data-driven state of the art system by 2.8/3.9
points (Unlabelled Attachment Score (UAS)/Labelled Attachment Score (LAS))
absolute gain. Interestingly, it also supersedes the performance of the state
of the art hybrid system by 1.2 points (UAS) absolute gain and shows comparable
performance in terms of LAS. Code and data will be publicly available at:
\url{https://github.com/Jivnesh/SanDP}.
Related papers
- Refining Sentence Embedding Model through Ranking Sentences Generation with Large Language Models [60.00178316095646]
Sentence embedding is essential for many NLP tasks, with contrastive learning methods achieving strong performance using datasets like NLI.
Recent studies leverage large language models (LLMs) to generate sentence pairs, reducing annotation dependency.
We propose a method for controlling the generation direction of LLMs in the latent space. Unlike unconstrained generation, the controlled approach ensures meaningful semantic divergence.
Experiments on multiple benchmarks demonstrate that our method achieves new SOTA performance with a modest cost in ranking sentence synthesis.
arXiv Detail & Related papers (2025-02-19T12:07:53Z) - Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu [53.437954702561065]
In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT.
This study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language.
Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.
arXiv Detail & Related papers (2025-02-17T14:53:49Z) - Token-Level Graphs for Short Text Classification [1.6819960041696331]
We propose an approach which constructs text graphs entirely based on tokens obtained through pre-trained language models (PLMs)
Our method captures contextual and semantic information, overcomes vocabulary constraints, and allows for context-dependent word meanings.
Experimental results demonstrate how our method consistently achieves higher scores or on-par performance with existing methods.
arXiv Detail & Related papers (2024-12-17T10:19:44Z) - Uniform Discretized Integrated Gradients: An effective attribution based method for explaining large language models [0.0]
Integrated Gradients is a well-known technique for explaining deep learning models.
In this paper, we propose a method called Uniform Discretized Integrated Gradients (UDIG)
We evaluate our method on two types of NLP tasks- Sentiment Classification and Question Answering against three metrics viz Log odds, Comprehensiveness and Sufficiency.
arXiv Detail & Related papers (2024-12-05T05:39:03Z) - Cross-lingual Back-Parsing: Utterance Synthesis from Meaning Representation for Zero-Resource Semantic Parsing [6.074150063191985]
Cross-Lingual Back-Parsing is a novel data augmentation methodology designed to enhance cross-lingual transfer for semantic parsing.
Our methodology effectively performs cross-lingual data augmentation in challenging zero-resource settings.
arXiv Detail & Related papers (2024-10-01T08:53:38Z) - Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - Optimal Transport Posterior Alignment for Cross-lingual Semantic Parsing [68.47787275021567]
Cross-lingual semantic parsing transfers parsing capability from a high-resource language (e.g., English) to low-resource languages with scarce training data.
We propose a new approach to cross-lingual semantic parsing by explicitly minimizing cross-lingual divergence between latent variables using Optimal Transport.
arXiv Detail & Related papers (2023-07-09T04:52:31Z) - Selective In-Context Data Augmentation for Intent Detection using
Pointwise V-Information [100.03188187735624]
We introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model.
Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents.
Our method is thus able to leverage the expressive power of large language models to produce diverse training data.
arXiv Detail & Related papers (2023-02-10T07:37:49Z) - SegAugment: Maximizing the Utility of Speech Translation Data with
Segmentation-based Augmentations [2.535399238341164]
End-to-end Speech Translation is hindered by a lack of available data resources.
We propose a new data augmentation strategy, SegAugment, to address this issue.
We show that the proposed method can also successfully augment sentence-level datasets.
arXiv Detail & Related papers (2022-12-19T18:29:31Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.