Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios
- URL: http://arxiv.org/abs/2406.00343v2
- Date: Thu, 13 Jun 2024 17:53:45 GMT
- Title: Beyond Metrics: Evaluating LLMs' Effectiveness in Culturally Nuanced, Low-Resource Real-World Scenarios
- Authors: Millicent Ochieng, Varun Gumma, Sunayana Sitaram, Jindong Wang, Vishrav Chaudhary, Keshet Ronen, Kalika Bali, Jacki O'Neill,
- Abstract summary: This research evaluates the performance of seven leading Large Language Models (LLMs) in sentiment analysis on a dataset derived from WhatsApp chats.
We find that while Mistral-7b and Mixtral-8x7b achieved high F1 scores, they and other LLMs such as GPT-3.5-Turbo, Llama-2-70b, and Gemma-7b struggled with understanding linguistic and contextual nuances.
GPT-4 and GPT-4-Turbo excelled in grasping diverse linguistic inputs and managing various contextual information.
- Score: 29.56889133557681
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The deployment of Large Language Models (LLMs) in real-world applications presents both opportunities and challenges, particularly in multilingual and code-mixed communication settings. This research evaluates the performance of seven leading LLMs in sentiment analysis on a dataset derived from multilingual and code-mixed WhatsApp chats, including Swahili, English and Sheng. Our evaluation includes both quantitative analysis using metrics like F1 score and qualitative assessment of LLMs' explanations for their predictions. We find that, while Mistral-7b and Mixtral-8x7b achieved high F1 scores, they and other LLMs such as GPT-3.5-Turbo, Llama-2-70b, and Gemma-7b struggled with understanding linguistic and contextual nuances, as well as lack of transparency in their decision-making process as observed from their explanations. In contrast, GPT-4 and GPT-4-Turbo excelled in grasping diverse linguistic inputs and managing various contextual information, demonstrating high consistency with human alignment and transparency in their decision-making process. The LLMs however, encountered difficulties in incorporating cultural nuance especially in non-English settings with GPT-4s doing so inconsistently. The findings emphasize the necessity of continuous improvement of LLMs to effectively tackle the challenges of culturally nuanced, low-resource real-world settings and the need for developing evaluation benchmarks for capturing these issues.
Related papers
- Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [51.18383180774354]
We introduce Multi-IF, a new benchmark designed to assess Large Language Models' proficiency in following multi-turn and multilingual instructions.
Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks.
languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities.
arXiv Detail & Related papers (2024-10-21T00:59:47Z) - Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation.
Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs.
Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z) - Hate Personified: Investigating the role of LLMs in content moderation [64.26243779985393]
For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model's (LLM) ability to represent diverse groups is unclear.
By including additional context in prompts, we analyze LLM's sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected.
arXiv Detail & Related papers (2024-10-03T16:43:17Z) - MindMerger: Efficient Boosting LLM Reasoning in non-English Languages [26.334092384176518]
Reasoning capabilities are crucial for Large Language Models (LLMs)
We propose MindMerger, which merges LLMs with the external language understanding capabilities from multilingual models.
MindMerger consistently outperforms all baselines, especially in low-resource languages.
arXiv Detail & Related papers (2024-05-27T17:41:54Z) - Benchmarking Large Language Models on CFLUE -- A Chinese Financial Language Understanding Evaluation Dataset [7.954348293179786]
We propose CFLUE, a benchmark to assess the capability of large language models (LLMs) across various dimensions.
In knowledge assessment, it consists of 38K+ multiple-choice questions with associated solution explanations.
In application assessment, it features 16K+ test instances across distinct groups of NLP tasks such as text classification, machine translation, relation extraction, reading comprehension, and text generation.
arXiv Detail & Related papers (2024-05-17T05:03:40Z) - METAL: Towards Multilingual Meta-Evaluation [12.852595634767901]
This study proposes a framework for an end-to-end assessment of Large Language Models (LLMs) as evaluators in multilingual scenarios.
We create a dataset covering 10 languages containing native speaker judgments for the task of summarization.
We compare the performance of LLM-based evaluators created using GPT-3.5-Turbo, GPT-4, and PaLM2.
arXiv Detail & Related papers (2024-04-02T06:14:54Z) - Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models [11.563813473794013]
Open-sourced large language models (LLMs) exhibit limited effectiveness in processing Vietnamese.
To mitigate these issues, we have finetuned LLMs specifically for Vietnamese.
Our evaluation results reveal that the fine-tuned LLMs exhibit enhanced comprehension and generative capabilities in Vietnamese.
arXiv Detail & Related papers (2024-03-05T07:13:28Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.