Beyond English: Evaluating LLMs for Arabic Grammatical Error Correction
- URL: http://arxiv.org/abs/2312.08400v1
- Date: Wed, 13 Dec 2023 05:33:25 GMT
- Title: Beyond English: Evaluating LLMs for Arabic Grammatical Error Correction
- Authors: Sang Yun Kwon, Gagan Bhatia, El Moatez Billah Nagoudi, Muhammad
Abdul-Mageed
- Abstract summary: Large language models (LLMs) finetuned to follow human instruction have exhibited significant capabilities in English NLP tasks.
We evaluate the abilities of instruction finetuned LLMs in Arabic grammatical error correction (GEC)
We find that instruction finetuned models, regardless of their size, are still outperformed by fully finetuned ones, even if they are significantly smaller in size.
- Score: 19.970419667319046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) finetuned to follow human instruction have
recently exhibited significant capabilities in various English NLP tasks.
However, their performance in grammatical error correction (GEC), especially on
languages other than English, remains significantly unexplored. In this work,
we evaluate the abilities of instruction finetuned LLMs in Arabic GEC, a
complex task due to Arabic's rich morphology. Our findings suggest that various
prompting methods, coupled with (in-context) few-shot learning, demonstrate
considerable effectiveness, with GPT-4 achieving up to $65.49$ F$_{1}$ score
under expert prompting (approximately $5$ points higher than our established
baseline). Despite these positive results, we find that instruction finetuned
models, regardless of their size, are still outperformed by fully finetuned
ones, even if they are significantly smaller in size. This disparity highlights
substantial room for improvements for LLMs. Inspired by methods used in
low-resource machine translation, we also develop a method exploiting synthetic
data that significantly outperforms previous models on two standard Arabic
benchmarks. Our best model achieves a new SOTA on Arabic GEC, with $73.29$ and
$73.26$ F$_{1}$ on the 2014 and 2015 QALB datasets, respectively, compared to
peer-reviewed published baselines.
Related papers
- Danoliteracy of Generative, Large Language Models [1.3873323883842132]
We present a GLLM benchmark to evaluate Danoliteracy, a measure of Danish language and cultural competency.
We find one strong underlying factor explaining $95%$ of scenario performance variance for GLLMs in Danish.
arXiv Detail & Related papers (2024-10-30T09:18:31Z) - One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [55.35278531907263]
We present the first study on Large Language Models' fairness and robustness to a dialect in canonical reasoning tasks.
We hire AAVE speakers to rewrite seven popular benchmarks, such as HumanEval and GSM8K.
We find that, compared to Standardized English, almost all of these widely used models show significant brittleness and unfairness to queries in AAVE.
arXiv Detail & Related papers (2024-10-14T18:44:23Z) - To Distill or Not to Distill? On the Robustness of Robust Knowledge Distillation [16.655022975392992]
Current multilingual ASR models are compute-intensive and lack proper comprehensive evaluations.
We distill knowledge from large teacher models into smaller student variants that are more efficient.
Our best-distilled model's overall performance ($45.0$% WER) surpasses that of a SoTA model twice its size.
arXiv Detail & Related papers (2024-06-06T21:11:53Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - How good are Large Language Models on African Languages? [18.660783984850845]
We present an analysis of four popular large language models (mT0, Aya, LLaMa 2, and GPT-4) on six tasks across 60 African languages.
Our results suggest that all LLMs produce lower performance for African languages, and there is a large gap in performance compared to high-resource languages.
arXiv Detail & Related papers (2023-11-14T08:10:14Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - ChatGPT for Arabic Grammatical Error Correction [5.945320097465418]
Large language models (LLMs) fine-tuned to follow human instruction have exhibited significant capabilities in English NLP tasks.
In this paper, we delve into abilities of instruction fine-tuned LLMs in Arabic GEC, a task made complex due to Arabic's rich morphology.
We find that instruction fine-tuned models, regardless of their size, significantly underperform compared to fully fine-tuned models of significantly smaller sizes.
arXiv Detail & Related papers (2023-08-08T18:00:39Z) - TIM: Teaching Large Language Models to Translate with Comparison [78.66926087162672]
We propose a novel framework using examples in comparison to teach LLMs to learn translation.
Our approach involves presenting the model with examples of correct and incorrect translations and using a preference loss to guide the model's learning.
Our findings offer a new perspective on fine-tuning LLMs for translation tasks and provide a promising solution for generating high-quality translations.
arXiv Detail & Related papers (2023-07-10T08:15:40Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - LAraBench: Benchmarking Arabic AI with Large Language Models [26.249084464525044]
LAraBench addresses this gap for Arabic Natural Language Processing (NLP) and Speech Processing tasks.
We utilize models such as GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM to tackle 33 distinct tasks across 61 publicly available datasets.
This involved 98 experimental setups, encompassing 296K data points, 46 hours of speech, and 30 sentences for Text-to-Speech (TTS)
arXiv Detail & Related papers (2023-05-24T10:16:16Z) - Transcending Scaling Laws with 0.1% Extra Compute [128.13903265447675]
Scaling language models improves performance but comes with significant computational costs.
This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute.
We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics.
arXiv Detail & Related papers (2022-10-20T16:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.