The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64
Languages
- URL: http://arxiv.org/abs/2310.14557v1
- Date: Mon, 23 Oct 2023 04:22:44 GMT
- Title: The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64
Languages
- Authors: Chiyu Zhang, Khai Duy Doan, Qisheng Liao, Muhammad Abdul-Mageed
- Abstract summary: We present SPARROW, an extensive benchmark specifically designed for cross-lingual sociopragmatic meaning (SM) understanding.
SPARROW comprises 169 datasets covering 13 task types across six primary categories (e.g., anti-social language detection, emotion recognition)
We evaluate the performance of various multilingual pretrained language models (e.g., mT5) and instruction-tuned LLMs (e.g., BLOOMZ, ChatGPT) on SPARROW through fine-tuning, zero-shot, and/or few-shot learning.
- Score: 17.055109973224265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction tuned large language models (LLMs), such as ChatGPT, demonstrate
remarkable performance in a wide range of tasks. Despite numerous recent
studies that examine the performance of instruction-tuned LLMs on various NLP
benchmarks, there remains a lack of comprehensive investigation into their
ability to understand cross-lingual sociopragmatic meaning (SM), i.e., meaning
embedded within social and interactive contexts. This deficiency arises partly
from SM not being adequately represented in any of the existing benchmarks. To
address this gap, we present SPARROW, an extensive multilingual benchmark
specifically designed for SM understanding. SPARROW comprises 169 datasets
covering 13 task types across six primary categories (e.g., anti-social
language detection, emotion recognition). SPARROW datasets encompass 64
different languages originating from 12 language families representing 16
writing scripts. We evaluate the performance of various multilingual pretrained
language models (e.g., mT5) and instruction-tuned LLMs (e.g., BLOOMZ, ChatGPT)
on SPARROW through fine-tuning, zero-shot, and/or few-shot learning. Our
comprehensive analysis reveals that existing open-source instruction tuned LLMs
still struggle to understand SM across various languages, performing close to a
random baseline in some cases. We also find that although ChatGPT outperforms
many LLMs, it still falls behind task-specific finetuned models with a gap of
12.19 SPARROW score. Our benchmark is available at:
https://github.com/UBC-NLP/SPARROW
Related papers
- Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
Language Models [59.54423478596468]
We introduce OMGEval, the first Open-source Multilingual Generative test set that can assess the capability of LLMs in different languages.
For each language, OMGEval provides 804 open-ended questions, covering a wide range of important capabilities of LLMs.
Specifically, the current version of OMGEval includes 5 languages (i.e., Zh, Ru, Fr, Es, Ar)
arXiv Detail & Related papers (2024-02-21T04:42:41Z) - CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language.
CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances.
To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning:
Insights and Observations [90.73517523001149]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct.
We propose different training strategies to build powerful xMR LLMs, named MathOctopus, notably outperform conventional open-source LLMs.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - XSemPLR: Cross-Lingual Semantic Parsing in Multiple Natural Languages
and Meaning Representations [25.50509874992198]
Cross-Lingual Semantic Parsing aims to translate queries in multiple natural languages into meaning representations.
Existing CLSP models are separately proposed and evaluated on datasets of limited tasks and applications.
We present XSemPLR, a unified benchmark for cross-lingual semantic parsing featured with 22 natural languages and 8 meaning representations.
arXiv Detail & Related papers (2023-06-07T01:09:37Z) - LAraBench: Benchmarking Arabic AI with Large Language Models [26.249084464525044]
LAraBench addresses this gap for Arabic Natural Language Processing (NLP) and Speech Processing tasks.
We utilize models such as GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM to tackle 33 distinct tasks across 61 publicly available datasets.
This involved 98 experimental setups, encompassing 296K data points, 46 hours of speech, and 30 sentences for Text-to-Speech (TTS)
arXiv Detail & Related papers (2023-05-24T10:16:16Z) - Multilingual Large Language Models Are Not (Yet) Code-Switchers [41.47534626749588]
Large Language Models (LLMs) have recently shown great capabilities in a wide range of tasks.
The practice of alternating languages within an utterance remains relatively uncharted.
We argue that current "multilingualism" in LLMs does not inherently imply proficiency with code-switching texts.
arXiv Detail & Related papers (2023-05-23T16:50:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.