Related papers: Generalists vs. Specialists: Evaluating Large Language Models for Urdu

Generalists vs. Specialists: Evaluating Large Language Models for Urdu

URL: http://arxiv.org/abs/2407.04459v3
Date: Thu, 3 Oct 2024 09:32:31 GMT
Title: Generalists vs. Specialists: Evaluating Large Language Models for Urdu
Authors: Samee Arif, Abdul Hameed Azeemi, Agha Ali Raza, Awais Athar,
Abstract summary: We compare general-purpose models, GPT-4-Turbo and Llama-3-8b, with special-purpose models--XLM-Roberta-large, mT5-large, and Llama-3-8b--that have been fine-tuned on specific tasks. We focus on seven classification and seven generation tasks to evaluate the performance of these models on Urdu language.
Score: 4.8539869147159616
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we compare general-purpose models, GPT-4-Turbo and Llama-3-8b, with special-purpose models--XLM-Roberta-large, mT5-large, and Llama-3-8b--that have been fine-tuned on specific tasks. We focus on seven classification and seven generation tasks to evaluate the performance of these models on Urdu language. Urdu has 70 million native speakers, yet it remains underrepresented in Natural Language Processing (NLP). Despite the frequent advancements in Large Language Models (LLMs), their performance in low-resource languages, including Urdu, still needs to be explored. We also conduct a human evaluation for the generation tasks and compare the results with the evaluations performed by GPT-4-Turbo, Llama-3-8b and Claude 3.5 Sonnet. We find that special-purpose models consistently outperform general-purpose models across various tasks. We also find that the evaluation done by GPT-4-Turbo for generation tasks aligns more closely with human evaluation compared to the evaluation the evaluation done by Llama-3-8b. This paper contributes to the NLP community by providing insights into the effectiveness of general and specific-purpose LLMs for low-resource languages.

Related papers

Evaluating Modern Large Language Models on Low-Resource and Morphologically Rich Languages:A Cross-Lingual Benchmark Across Cantonese, Japanese, and Turkish [12.286855282078305]
GPT-4o, GPT-4, Claude3.5Sonnet, LLaMA3.1, MistralLarge2, LLaMA-2Chat13B, and Mistral7BInstruct are evaluated.<n>Our benchmark spans four diverse tasks: open-domain question answering, document summarization, English-to-X translation, and culturally grounded dialogue.
arXiv Detail & Related papers (2025-11-05T22:09:53Z)
Beyond Specialization: Benchmarking LLMs for Transliteration of Indian Languages [19.988253285050256]
Transliteration plays a crucial role in multilingual natural language processing.<n>Recent developments in large language models suggest a potential for general-purpose models to excel at this task without explicit task-specific training.<n>This work systematically evaluates the performance of prominent LLMs, including GPT-4o, GPT-4.5, GPT-4.1, Gemma-3-27B-it, and Mistral-Large against IndicXlit, a state-of-the-art transliteration model, across ten major Indian languages.
arXiv Detail & Related papers (2025-05-26T11:35:51Z)
Unraveling the Capabilities of Language Models in News Summarization [0.0]
This work provides a comprehensive benchmarking of 20 recent language models, focusing on smaller ones for the news summarization task. We focus in this study on zero-shot and few-shot learning settings and we apply a robust evaluation methodology. We highlight the exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate due to their advanced capabilities.
arXiv Detail & Related papers (2025-01-30T04:20:16Z)
Benchmarking Pre-trained Large Language Models' Potential Across Urdu NLP tasks [0.9786690381850356]
Large Language Models (LLMs) pre-trained on multilingual data have revolutionized natural language processing research. This study presents an in-depth examination of prominent LLMs, across 14 tasks using 15 Urdu datasets. Experiments show that SOTA models surpass all the encoder-decoder pre-trained language models in all Urdu NLP tasks with zero-shot learning.
arXiv Detail & Related papers (2024-05-24T11:30:37Z)
A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks [30.54635848057259]
This paper conducts a comprehensive evaluation of well-known and high-performing large language models (LLMs) We select English and Chinese datasets encompassing Dialogue Generation and Text Summarization. Our study reports both automatic results, accompanied by a detailed analysis.
arXiv Detail & Related papers (2024-05-16T16:56:54Z)
SeaLLMs -- Large Language Models for Southeast Asia [76.50157503379086]
We introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning. Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities.
arXiv Detail & Related papers (2023-12-01T17:17:56Z)
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation? [20.476500441734427]
Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks. Their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations.
arXiv Detail & Related papers (2023-09-14T06:41:58Z)
Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models [6.145834902689888]
Large language models (LLMs) have demonstrated impressive performance on various downstream tasks without requiring fine-tuning. Despite having a lower training proportion compared to English, these models also exhibit remarkable capabilities in other languages. In this study, we assess the performance of GPT-3.5 and GPT-4 models on seven distinct Arabic NLP tasks.
arXiv Detail & Related papers (2023-06-28T15:54:29Z)
Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples. With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot. We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)
Elaboration-Generating Commonsense Question Answering at Scale [77.96137534751445]
In question answering requiring common sense, language models (e.g., GPT-3) have been used to generate text expressing background knowledge. We finetune smaller language models to generate useful intermediate context, referred to here as elaborations. Our framework alternates between updating two language models -- an elaboration generator and an answer predictor -- allowing each to influence the other.
arXiv Detail & Related papers (2022-09-02T18:32:09Z)
Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages. Our largest model sets new state of the art in few-shot learning in more than 20 representative languages. We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z)
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model) GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants. It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.