The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64
Languages
- URL: http://arxiv.org/abs/2310.14557v1
- Date: Mon, 23 Oct 2023 04:22:44 GMT
- Title: The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64
Languages
- Authors: Chiyu Zhang, Khai Duy Doan, Qisheng Liao, Muhammad Abdul-Mageed
- Abstract summary: We present SPARROW, an extensive benchmark specifically designed for cross-lingual sociopragmatic meaning (SM) understanding.
SPARROW comprises 169 datasets covering 13 task types across six primary categories (e.g., anti-social language detection, emotion recognition)
We evaluate the performance of various multilingual pretrained language models (e.g., mT5) and instruction-tuned LLMs (e.g., BLOOMZ, ChatGPT) on SPARROW through fine-tuning, zero-shot, and/or few-shot learning.
- Score: 17.055109973224265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction tuned large language models (LLMs), such as ChatGPT, demonstrate
remarkable performance in a wide range of tasks. Despite numerous recent
studies that examine the performance of instruction-tuned LLMs on various NLP
benchmarks, there remains a lack of comprehensive investigation into their
ability to understand cross-lingual sociopragmatic meaning (SM), i.e., meaning
embedded within social and interactive contexts. This deficiency arises partly
from SM not being adequately represented in any of the existing benchmarks. To
address this gap, we present SPARROW, an extensive multilingual benchmark
specifically designed for SM understanding. SPARROW comprises 169 datasets
covering 13 task types across six primary categories (e.g., anti-social
language detection, emotion recognition). SPARROW datasets encompass 64
different languages originating from 12 language families representing 16
writing scripts. We evaluate the performance of various multilingual pretrained
language models (e.g., mT5) and instruction-tuned LLMs (e.g., BLOOMZ, ChatGPT)
on SPARROW through fine-tuning, zero-shot, and/or few-shot learning. Our
comprehensive analysis reveals that existing open-source instruction tuned LLMs
still struggle to understand SM across various languages, performing close to a
random baseline in some cases. We also find that although ChatGPT outperforms
many LLMs, it still falls behind task-specific finetuned models with a gap of
12.19 SPARROW score. Our benchmark is available at:
https://github.com/UBC-NLP/SPARROW
Related papers
- Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [51.18383180774354]
We introduce Multi-IF, a new benchmark designed to assess Large Language Models' proficiency in following multi-turn and multilingual instructions.
Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks.
languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities.
arXiv Detail & Related papers (2024-10-21T00:59:47Z) - Do Large Language Models Speak All Languages Equally? A Comparative Study in Low-Resource Settings [12.507989493130175]
Large language models (LLMs) have garnered significant interest in natural language processing (NLP)
Recent studies have highlighted the limitations of LLMs in low-resource languages.
We present datasets for sentiment and hate speech tasks by translating from English to Bangla, Hindi, and Urdu.
arXiv Detail & Related papers (2024-08-05T05:09:23Z) - Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - LAraBench: Benchmarking Arabic AI with Large Language Models [26.249084464525044]
LAraBench addresses this gap for Arabic Natural Language Processing (NLP) and Speech Processing tasks.
We utilize models such as GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM to tackle 33 distinct tasks across 61 publicly available datasets.
This involved 98 experimental setups, encompassing 296K data points, 46 hours of speech, and 30 sentences for Text-to-Speech (TTS)
arXiv Detail & Related papers (2023-05-24T10:16:16Z) - Multilingual Large Language Models Are Not (Yet) Code-Switchers [41.47534626749588]
Large Language Models (LLMs) have recently shown great capabilities in a wide range of tasks.
The practice of alternating languages within an utterance remains relatively uncharted.
We argue that current "multilingualism" in LLMs does not inherently imply proficiency with code-switching texts.
arXiv Detail & Related papers (2023-05-23T16:50:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.