Related papers: Context-Gloss Augmentation for Improving Arabic Target Sense Verification

Context-Gloss Augmentation for Improving Arabic Target Sense Verification

URL: http://arxiv.org/abs/2302.03126v1
Date: Mon, 6 Feb 2023 21:24:02 GMT
Title: Context-Gloss Augmentation for Improving Arabic Target Sense Verification
Authors: Sanad Malaysha, Mustafa Jarrar, Mohammed Khalilia
Abstract summary: The most common semantically-labeled dataset for Arabic is the ArabGlossBERT. This paper presents an enrichment to the ArabGlossBERT dataset, by augmenting it using machine back-translation. We measure the impact of augmentation using different data configurations to fine-tune BERT on target sense verification (TSV) task.
Score: 1.2891210250935146
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Arabic language lacks semantic datasets and sense inventories. The most common semantically-labeled dataset for Arabic is the ArabGlossBERT, a relatively small dataset that consists of 167K context-gloss pairs (about 60K positive and 107K negative pairs), collected from Arabic dictionaries. This paper presents an enrichment to the ArabGlossBERT dataset, by augmenting it using (Arabic-English-Arabic) machine back-translation. Augmentation increased the dataset size to 352K pairs (149K positive and 203K negative pairs). We measure the impact of augmentation using different data configurations to fine-tune BERT on target sense verification (TSV) task. Overall, the accuracy ranges between 78% to 84% for different data configurations. Although our approach performed at par with the baseline, we did observe some improvements for some POS tags in some experiments. Furthermore, our fine-tuned models are trained on a larger dataset covering larger vocabulary and contexts. We provide an in-depth analysis of the accuracy for each part-of-speech (POS).

Related papers

EHSAN: Leveraging ChatGPT in a Hybrid Framework for Arabic Aspect-Based Sentiment Analysis in Healthcare [0.0]
We introduce EHSAN, a data-centric hybrid pipeline that merges ChatGPT pseudo-labelling with targeted human review to build the first explainable Arabic aspect-based sentiment dataset for healthcare.<n>Each sentence is annotated with an aspect and sentiment label (positive, negative, or neutral), forming a pioneering Arabic dataset aligned with healthcare themes.<n>Future directions include generalisation across hospitals, prompt refinement, and interpretable data-driven modelling.
arXiv Detail & Related papers (2025-08-04T16:28:58Z)
The Empirical Impact of Data Sanitization on Language Models [1.1359551336076306]
This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks. Our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%. For tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original.
arXiv Detail & Related papers (2024-11-08T21:22:37Z)
NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z)
Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet. On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z)
Characterizing and Measuring Linguistic Dataset Drift [65.28821163863665]
We propose three dimensions of linguistic dataset drift: vocabulary, structural, and semantic drift. These dimensions correspond to content word frequency divergences, syntactic divergences, and meaning changes not captured by word frequencies. We find that our drift metrics are more effective than previous metrics at predicting out-of-domain model accuracies.
arXiv Detail & Related papers (2023-05-26T17:50:51Z)
How to Solve Few-Shot Abusive Content Detection Using the Data We Actually Have [58.23138483086277]
In this work we leverage datasets we already have, covering a wide range of tasks related to abusive language detection. Our goal is to build models cheaply for a new target label set and/or language, using only a few training examples of the target domain. Our experiments show that using already existing datasets and only a few-shots of the target task the performance of models improve both monolingually and across languages.
arXiv Detail & Related papers (2023-05-23T14:04:12Z)
Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages [18.210880703295253]
We finetune pretrained language models (PLMs) on seven languages from three different families. We analyze their zero-shot performance on closely related, non-standardized varieties. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data is the strongest predictor for model performance on target data.
arXiv Detail & Related papers (2023-04-20T08:32:34Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification [0.0]
We propose a new Arabic DA method that employs the recent powerful modeling technique, namely the AraGPT-2. The generated sentences are evaluated in terms of context, semantics, diversity, and novelty using the Euclidean, cosine, Jaccard, and BLEU distances. The experiments were conducted on four sentiment Arabic datasets: AraSarcasm, ASTD, ATT, and MOVIE.
arXiv Detail & Related papers (2022-12-28T16:38:43Z)
ArabGlossBERT: Fine-Tuning BERT on Context-Gloss Pairs for WSD [0.0]
This paper presents our work to fine-tune BERT models for Arabic Word Sense Disambiguation (WSD) We constructed a dataset of labeled Arabic context-gloss pairs. Each pair was labeled as True or False and target words in each context were identified and annotated.
arXiv Detail & Related papers (2022-05-19T16:47:18Z)
Arabic Offensive Language on Twitter: Analysis and Experiments [9.879488163141813]
We introduce a method for building a dataset that is not biased by topic, dialect, or target. We produce the largest Arabic dataset to date with special tags for vulgarity and hate speech.
arXiv Detail & Related papers (2020-04-05T13:05:11Z)
Parameter Space Factorization for Zero-Shot Learning across Tasks and Languages [112.65994041398481]
We propose a Bayesian generative model for the space of neural parameters. We infer the posteriors over such latent variables based on data from seen task-language combinations. Our model yields comparable or better results than state-of-the-art, zero-shot cross-lingual transfer methods.
arXiv Detail & Related papers (2020-01-30T16:58:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.