Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination
- URL: http://arxiv.org/abs/2406.08818v3
- Date: Tue, 17 Sep 2024 05:29:50 GMT
- Title: Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination
- Authors: Eve Fleisig, Genevieve Smith, Madeline Bossi, Ishita Rustagi, Xavier Yin, Dan Klein,
- Abstract summary: ChatGPT covers ten dialects of English (Standard American English, Standard British English, and eight widely spoken non-"standard" varieties from around the world)
We prompted GPT-3.5 Turbo and GPT-4 with text by native speakers of each variety and analyzed the responses via linguistic feature annotation and native speaker evaluation.
We find that GPT-3.5 Turbo and GPT-4 can perpetuate linguistic discrimination toward speakers of non-"standard" varieties.
- Score: 29.162606891172615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a large-scale study of linguistic bias exhibited by ChatGPT covering ten dialects of English (Standard American English, Standard British English, and eight widely spoken non-"standard" varieties from around the world). We prompted GPT-3.5 Turbo and GPT-4 with text by native speakers of each variety and analyzed the responses via detailed linguistic feature annotation and native speaker evaluation. We find that the models default to "standard" varieties of English; based on evaluation by native speakers, we also find that model responses to non-"standard" varieties consistently exhibit a range of issues: stereotyping (19% worse than for "standard" varieties), demeaning content (25% worse), lack of comprehension (9% worse), and condescending responses (15% worse). We also find that if these models are asked to imitate the writing style of prompts in non-"standard" varieties, they produce text that exhibits lower comprehension of the input and is especially prone to stereotyping. GPT-4 improves on GPT-3.5 in terms of comprehension, warmth, and friendliness, but also exhibits a marked increase in stereotyping (+18%). The results indicate that GPT-3.5 Turbo and GPT-4 can perpetuate linguistic discrimination toward speakers of non-"standard" varieties.
Related papers
- Dutch CrowS-Pairs: Adapting a Challenge Dataset for Measuring Social Biases in Language Models for Dutch [6.522338519818378]
A Dutch version of the US-specific CrowS-Pairs dataset for measuring bias in Dutch language models is introduced.<n>The resulting dataset consists of 1463 sentence pairs that cover bias in 9 categories, such as Sexual orientation, Gender and Disability.<n>Using the English and French versions of the CrowS-Pairs dataset, bias was evaluated in English (BERT and RoBERTa) and French (FlauBERT and CamemBERT) language models.
arXiv Detail & Related papers (2025-07-22T10:38:02Z) - Quite Good, but Not Enough: Nationality Bias in Large Language Models -- A Case Study of ChatGPT [4.998396762666333]
This study investigates nationality bias in ChatGPT (GPT-3.5), a large language model (LLM) designed for text generation.
The research covers 195 countries, 4 temperature settings, and 3 distinct prompt types, generating 4,680 discourses about nationality descriptions in Chinese and English.
arXiv Detail & Related papers (2024-05-11T12:11:52Z) - ChatGPT v.s. Media Bias: A Comparative Study of GPT-3.5 and Fine-tuned Language Models [0.276240219662896]
This study seeks to answer this question by leveraging the Media Bias Identification Benchmark (MBIB)
It assesses ChatGPT's competency in distinguishing six categories of media bias, juxtaposed against fine-tuned models such as BART, ConvBERT, and GPT-2.
The findings present a dichotomy: ChatGPT performs at par with fine-tuned models in detecting hate speech and text-level context bias, yet faces difficulties with subtler elements of other bias detections.
arXiv Detail & Related papers (2024-03-29T13:12:09Z) - Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs [45.906366638174624]
This paper reports the first study on the behavior of large language models with reference to conversion.
We design a task for testing the degree to which models can generalize over words in a construction with a non-prototypical part of speech.
We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it.
arXiv Detail & Related papers (2024-03-26T16:45:27Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Multilingual large language models leak human stereotypes across language boundaries [25.903732543380528]
We study how training a model multilingually may lead to stereotypes expressed in one language showing up in the models' behaviour in another.
We propose a measurement framework for stereotype leakage and investigate its effect across English, Russian, Chinese, and Hindi.
We find that GPT-3.5 exhibits the most stereotype leakage, and Hindi is the most susceptible to leakage effects.
arXiv Detail & Related papers (2023-12-12T10:24:17Z) - Shepherd: A Critic for Language Model Generation [72.24142023628694]
We introduce Shepherd, a language model specifically tuned to critique responses and suggest refinements.
At the core of our approach is a high quality feedback dataset, which we curate from community feedback and human annotations.
In human evaluation, Shepherd strictly outperforms other models and on average closely ties with ChatGPT.
arXiv Detail & Related papers (2023-08-08T21:23:23Z) - GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP [21.6253870440136]
This study conducts a large-scale automated and human evaluation of ChatGPT, encompassing 44 distinct language understanding and generation tasks.
Our findings indicate that, despite its remarkable performance in English, ChatGPT is consistently surpassed by smaller models that have undergone finetuning on Arabic.
arXiv Detail & Related papers (2023-05-24T10:12:39Z) - Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine [97.8609714773255]
We evaluate ChatGPT for machine translation, including translation prompt, multilingual translation, and translation robustness.
ChatGPT performs competitively with commercial translation products but lags behind significantly on low-resource or distant languages.
With the launch of the GPT-4 engine, the translation performance of ChatGPT is significantly boosted.
arXiv Detail & Related papers (2023-01-20T08:51:36Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.