Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination
- URL: http://arxiv.org/abs/2406.08818v1
- Date: Thu, 13 Jun 2024 05:20:42 GMT
- Title: Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination
- Authors: Eve Fleisig, Genevieve Smith, Madeline Bossi, Ishita Rustagi, Xavier Yin, Dan Klein,
- Abstract summary: ChatGPT covers ten dialects of English (Standard American English, Standard British English, and eight widely spoken non-"standard" varieties from around the world)
We prompted GPT-3.5 Turbo and GPT-4 with text by native speakers of each variety and analyzed the responses via linguistic feature annotation and native speaker evaluation.
We find that GPT-3.5 Turbo and GPT-4 exhibit linguistic discrimination in ways that can exacerbate harms for speakers of non-"standard" varieties.
- Score: 29.162606891172615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a large-scale study of linguistic bias exhibited by ChatGPT covering ten dialects of English (Standard American English, Standard British English, and eight widely spoken non-"standard" varieties from around the world). We prompted GPT-3.5 Turbo and GPT-4 with text by native speakers of each variety and analyzed the responses via detailed linguistic feature annotation and native speaker evaluation. We find that the models default to "standard" varieties of English; based on evaluation by native speakers, we also find that model responses to non-"standard" varieties consistently exhibit a range of issues: lack of comprehension (10% worse compared to "standard" varieties), stereotyping (16% worse), demeaning content (22% worse), and condescending responses (12% worse). We also find that if these models are asked to imitate the writing style of prompts in non-"standard" varieties, they produce text that exhibits lower comprehension of the input and is especially prone to stereotyping. GPT-4 improves on GPT-3.5 in terms of comprehension, warmth, and friendliness, but it also results in a marked increase in stereotyping (+17%). The results suggest that GPT-3.5 Turbo and GPT-4 exhibit linguistic discrimination in ways that can exacerbate harms for speakers of non-"standard" varieties.
Related papers
- GPT-3.5 for Grammatical Error Correction [0.4757470449749875]
This paper investigates the application of GPT-3.5 for Grammatical Error Correction (GEC) in multiple languages.
We conduct automatic evaluations of the corrections proposed by GPT-3.5 using several methods.
For English, GPT-3.5 demonstrates high recall, generates fluent corrections, and generally preserves sentence semantics.
But, human evaluation for both English and Russian reveals that, despite its strong error-detection capabilities, GPT-3.5 struggles with several error types.
arXiv Detail & Related papers (2024-05-14T09:51:09Z) - Quite Good, but Not Enough: Nationality Bias in Large Language Models -- A Case Study of ChatGPT [4.998396762666333]
This study investigates nationality bias in ChatGPT (GPT-3.5), a large language model (LLM) designed for text generation.
The research covers 195 countries, 4 temperature settings, and 3 distinct prompt types, generating 4,680 discourses about nationality descriptions in Chinese and English.
arXiv Detail & Related papers (2024-05-11T12:11:52Z) - ChatGPT v.s. Media Bias: A Comparative Study of GPT-3.5 and Fine-tuned Language Models [0.276240219662896]
This study seeks to answer this question by leveraging the Media Bias Identification Benchmark (MBIB)
It assesses ChatGPT's competency in distinguishing six categories of media bias, juxtaposed against fine-tuned models such as BART, ConvBERT, and GPT-2.
The findings present a dichotomy: ChatGPT performs at par with fine-tuned models in detecting hate speech and text-level context bias, yet faces difficulties with subtler elements of other bias detections.
arXiv Detail & Related papers (2024-03-29T13:12:09Z) - Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs [45.906366638174624]
This paper reports the first study on the behavior of large language models with reference to conversion.
We design a task for testing the degree to which models can generalize over words in a construction with a non-prototypical part of speech.
We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it.
arXiv Detail & Related papers (2024-03-26T16:45:27Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties [0.0]
We aim to address the issue of bias at its root - the data itself.
We curate a dataset of tweets from countries with high proportions of underserved English variety speakers.
Following best annotation practices, our growing corpus features 170,800 tweets taken from 7 countries.
arXiv Detail & Related papers (2024-01-21T13:18:20Z) - Shepherd: A Critic for Language Model Generation [72.24142023628694]
We introduce Shepherd, a language model specifically tuned to critique responses and suggest refinements.
At the core of our approach is a high quality feedback dataset, which we curate from community feedback and human annotations.
In human evaluation, Shepherd strictly outperforms other models and on average closely ties with ChatGPT.
arXiv Detail & Related papers (2023-08-08T21:23:23Z) - Is ChatGPT A Good Translator? Yes With GPT-4 As The Engine [97.8609714773255]
We evaluate ChatGPT for machine translation, including translation prompt, multilingual translation, and translation robustness.
ChatGPT performs competitively with commercial translation products but lags behind significantly on low-resource or distant languages.
With the launch of the GPT-4 engine, the translation performance of ChatGPT is significantly boosted.
arXiv Detail & Related papers (2023-01-20T08:51:36Z) - NormSAGE: Multi-Lingual Multi-Cultural Norm Discovery from Conversations
On-the-Fly [61.77957329364812]
We introduce a framework for addressing the novel task of conversation-grounded multi-lingual, multi-cultural norm discovery.
NormSAGE elicits knowledge about norms through directed questions representing the norm discovery task and conversation context.
It further addresses the risk of language model hallucination with a self-verification mechanism ensuring that the norms discovered are correct.
arXiv Detail & Related papers (2022-10-16T18:30:05Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.