Automatic Language Identification for Celtic Texts
- URL: http://arxiv.org/abs/2203.04831v1
- Date: Wed, 9 Mar 2022 16:04:13 GMT
- Title: Automatic Language Identification for Celtic Texts
- Authors: Olha Dovbnia, Anna Wr\'oblewska
- Abstract summary: This work addresses the identification of the related low-resource languages on the example of the Celtic language family.
We collected a new dataset including Irish, Scottish, Welsh and English records.
We tested supervised models such as SVM and neural networks with traditional statistical features alongside the output of clustering, autoencoder, and topic modelling methods.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language identification is an important Natural Language Processing task. It
has been thoroughly researched in the literature. However, some issues are
still open. This work addresses the identification of the related low-resource
languages on the example of the Celtic language family.
This work's main goals were: (1) to collect the dataset of three Celtic
languages; (2) to prepare a method to identify the languages from the Celtic
family, i.e. to train a successful classification model; (3) to evaluate the
influence of different feature extraction methods, and explore the
applicability of the unsupervised models as a feature extraction technique; (4)
to experiment with the unsupervised feature extraction on a reduced annotated
set.
We collected a new dataset including Irish, Scottish, Welsh and English
records. We tested supervised models such as SVM and neural networks with
traditional statistical features alongside the output of clustering,
autoencoder, and topic modelling methods. The analysis showed that the
unsupervised features could serve as a valuable extension to the n-gram feature
vectors. It led to an improvement in performance for more entangled classes.
The best model achieved a 98\% F1 score and 97\% MCC. The dense neural network
consistently outperformed the SVM model.
The low-resource languages are also challenging due to the scarcity of
available annotated training data. This work evaluated the performance of the
classifiers using the unsupervised feature extraction on the reduced labelled
dataset to handle this issue. The results uncovered that the unsupervised
feature vectors are more robust to the labelled set reduction. Therefore, they
proved to help achieve comparable classification performance with much less
labelled data.
Related papers
- Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.
XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - A deep Natural Language Inference predictor without language-specific
training data [44.26507854087991]
We present a technique of NLP to tackle the problem of inference relation (NLI) between pairs of sentences in a target language of choice without a language-specific training dataset.
We exploit a generic translation dataset, manually translated, along with two instances of the same pre-trained model.
The model has been evaluated over machine translated Stanford NLI test dataset, machine translated Multi-Genre NLI test dataset, and manually translated RTE3-ITA test dataset.
arXiv Detail & Related papers (2023-09-06T10:20:59Z) - Enhancing Pashto Text Classification using Language Processing
Techniques for Single And Multi-Label Analysis [0.0]
This study aims to establish an automated classification system for Pashto text.
The study achieved an average testing accuracy rate of 94%.
The use of pre-trained language representation models, such as DistilBERT, showed promising results.
arXiv Detail & Related papers (2023-05-04T23:11:31Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Finding Dataset Shortcuts with Grammar Induction [85.47127659108637]
We propose to use probabilistic grammars to characterize and discover shortcuts in NLP datasets.
Specifically, we use a context-free grammar to model patterns in sentence classification datasets and use a synchronous context-free grammar to model datasets involving sentence pairs.
The resulting grammars reveal interesting shortcut features in a number of datasets, including both simple and high-level features.
arXiv Detail & Related papers (2022-10-20T19:54:11Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Recognition and Processing of NATOM [0.0]
This paper shows how to process the NOTAM (Notice to Airmen) data of the field in civil aviation.
For the original data of the NOTAM, there is a mixture of Chinese and English, and the structure is poor.
Using Glove word vector methods to represent the data for using a custom mapping vocabulary.
arXiv Detail & Related papers (2021-04-29T10:12:00Z) - Parameter Space Factorization for Zero-Shot Learning across Tasks and
Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters.
We infer the posteriors over such latent variables based on data from seen task-language combinations.
Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.