An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian Case
- URL: http://arxiv.org/abs/2507.19156v1
- Date: Fri, 25 Jul 2025 10:57:29 GMT
- Title: An Empirical Investigation of Gender Stereotype Representation in Large Language Models: The Italian Case
- Authors: Gioele Giachino, Marco Rondina, Antonio VetrĂ², Riccardo Coppola, Juan Carlos De Martin,
- Abstract summary: This study examines in which manner Large Language Models shape responses to ungendered prompts, contributing to biased outputs.<n>The results highlight how content generated by LLMs can perpetuate stereotypes.<n>The presence of bias in AI-generated text can have significant implications in many fields, such as in the workplaces or in job selections.
- Score: 0.41942958779358674
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing use of Large Language Models (LLMs) in a large variety of domains has sparked worries about how easily they can perpetuate stereotypes and contribute to the generation of biased content. With a focus on gender and professional bias, this work examines in which manner LLMs shape responses to ungendered prompts, contributing to biased outputs. This analysis uses a structured experimental method, giving different prompts involving three different professional job combinations, which are also characterized by a hierarchical relationship. This study uses Italian, a language with extensive grammatical gender differences, to highlight potential limitations in current LLMs' ability to generate objective text in non-English languages. Two popular LLM-based chatbots are examined, namely OpenAI ChatGPT (gpt-4o-mini) and Google Gemini (gemini-1.5-flash). Through APIs, we collected a range of 3600 responses. The results highlight how content generated by LLMs can perpetuate stereotypes. For example, Gemini associated 100% (ChatGPT 97%) of 'she' pronouns to the 'assistant' rather than the 'manager'. The presence of bias in AI-generated text can have significant implications in many fields, such as in the workplaces or in job selections, raising ethical concerns about its use. Understanding these risks is pivotal to developing mitigation strategies and assuring that AI-based systems do not increase social inequalities, but rather contribute to more equitable outcomes. Future research directions include expanding the study to additional chatbots or languages, refining prompt engineering methods or further exploiting a larger experimental base.
Related papers
- Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks.<n>We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs.<n>These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv Detail & Related papers (2025-03-06T05:15:34Z) - BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages [93.92804151830744]
We present BRIGHTER, a collection of multi-labeled, emotion-annotated datasets in 28 different languages.<n>We highlight the challenges related to the data collection and annotation processes.<n>We show that the BRIGHTER datasets represent a meaningful step towards addressing the gap in text-based emotion recognition.
arXiv Detail & Related papers (2025-02-17T15:39:50Z) - Mind the Gap! Choice Independence in Using Multilingual LLMs for Persuasive Co-Writing Tasks in Different Languages [51.96666324242191]
We analyze whether user utilization of novel writing assistants in a charity advertisement writing task is affected by the AI's performance in a second language.<n>We quantify the extent to which these patterns translate into the persuasiveness of generated charity advertisements.
arXiv Detail & Related papers (2025-02-13T17:49:30Z) - Improving Linguistic Diversity of Large Language Models with Possibility Exploration Fine-Tuning [23.456302461693053]
Possibility Exploration Fine-Tuning (PEFT) is a task-agnostic framework that enhances the text diversity of Large Language Models (LLMs) without increasing latency or computational cost.<n>PEFT significantly enhances the diversity of LLM outputs, as evidenced by lower similarity between candidate responses.<n>It can also notably reduce demographic bias in dialogue systems.
arXiv Detail & Related papers (2024-12-04T14:23:16Z) - Large Language Models Reflect the Ideology of their Creators [71.65505524599888]
Large language models (LLMs) are trained on vast amounts of data to generate natural language.<n>This paper shows that the ideological stance of an LLM appears to reflect the worldview of its creators.
arXiv Detail & Related papers (2024-10-24T04:02:30Z) - Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation [0.0]
Large Language Models (LLMs) have revolutionized artificial intelligence, demonstrating remarkable computational power and linguistic capabilities.<n>These models are inherently prone to various biases stemming from their training data.<n>This study explores the presence of these biases within the responses given by the most recent LLMs, analyzing the impact on their fairness and reliability.
arXiv Detail & Related papers (2024-07-11T12:30:19Z) - Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts [49.97673761305336]
We evaluate three large language models (LLMs) for their alignment with human narrative styles and potential gender biases.
Our findings indicate that, while these models generally produce text closely resembling human authored content, variations in stylistic features suggest significant gender biases.
arXiv Detail & Related papers (2024-06-27T19:26:11Z) - Hire Me or Not? Examining Language Model's Behavior with Occupation Attributes [7.718858707298602]
Large language models (LLMs) have been widely integrated into production pipelines, like recruitment and recommendation systems.<n>This paper investigates LLMs' behavior with respect to gender stereotypes, in the context of occupation decision making.
arXiv Detail & Related papers (2024-05-06T18:09:32Z) - White Men Lead, Black Women Help? Benchmarking and Mitigating Language Agency Social Biases in LLMs [58.27353205269664]
Social biases can manifest in language agency in Large Language Model (LLM)-generated content.<n>We introduce the Language Agency Bias Evaluation benchmark, which comprehensively evaluates biases in LLMs.<n>Using LABE, we unveil language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral.
arXiv Detail & Related papers (2024-04-16T12:27:54Z) - Multilingual Text-to-Image Generation Magnifies Gender Stereotypes and Prompt Engineering May Not Help You [64.74707085021858]
We show that multilingual models suffer from significant gender biases just as monolingual models do.<n>We propose a novel benchmark, MAGBIG, intended to foster research on gender bias in multilingual models.<n>Our results show that not only do models exhibit strong gender biases but they also behave differently across languages.
arXiv Detail & Related papers (2024-01-29T12:02:28Z) - Sociodemographic Prompting is Not Yet an Effective Approach for Simulating Subjective Judgments with LLMs [13.744746481528711]
Large Language Models (LLMs) are widely used to simulate human responses across diverse contexts.<n>We evaluate nine popular LLMs on their ability to understand demographic differences in two subjective judgment tasks: politeness and offensiveness.<n>We find that in zero-shot settings, most models' predictions for both tasks align more closely with labels from White participants than those from Asian or Black participants.
arXiv Detail & Related papers (2023-11-16T10:02:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.