Related papers: Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism

Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism

URL: http://arxiv.org/abs/2505.20500v1
Date: Mon, 26 May 2025 20:01:44 GMT
Title: Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism
Authors: Naba Rizvi, Harper Strickland, Saleha Ahmedi, Aekta Kallepalli, Isha Khirwadkar, William Wu, Imani N. S. Munyaka, Nedjma Ousidhoum,
Abstract summary: Large language models (LLMs) are increasingly used in decision-making tasks like r'esum'e screening and content moderation.<n>We evaluate the ability of four LLMs to identify nuanced ableism directed at autistic individuals.<n>Our results reveal that LLMs can identify autism-related language but often miss harmful or offensive connotations.
Score: 2.0435202333125977
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used in decision-making tasks like r\'esum\'e screening and content moderation, giving them the power to amplify or suppress certain perspectives. While previous research has identified disability-related biases in LLMs, little is known about how they conceptualize ableism or detect it in text. We evaluate the ability of four LLMs to identify nuanced ableism directed at autistic individuals. We examine the gap between their understanding of relevant terminology and their effectiveness in recognizing ableist content in context. Our results reveal that LLMs can identify autism-related language but often miss harmful or offensive connotations. Further, we conduct a qualitative comparison of human and LLM explanations. We find that LLMs tend to rely on surface-level keyword matching, leading to context misinterpretations, in contrast to human annotators who consider context, speaker identity, and potential impact. On the other hand, both LLMs and humans agree on the annotation scheme, suggesting that a binary classification is adequate for evaluating LLM performance, which is consistent with findings from prior studies involving human annotators.

Related papers

Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection [5.731621080995591]
Hate speech spreads widely online, harming individuals and communities, making automatic detection essential for large-scale moderation.<n>Part of the challenge lies in subjectivity: what one person flags as hate speech, another may see as benign.<n>Large Language Models (LLMs) promise scalable annotation, but prior studies demonstrate that they cannot fully replace human judgement.
arXiv Detail & Related papers (2025-12-10T14:00:48Z)
The Erosion of LLM Signatures: Can We Still Distinguish Human and LLM-Generated Scientific Ideas After Iterative Paraphrasing? [0.7162422068114824]
We evaluate the ability of state-of-the-art (SOTA) machine learning models to differentiate between human and LLM-generated ideas.<n>Our findings highlight the challenges SOTA models face in source attribution.
arXiv Detail & Related papers (2025-12-04T23:22:21Z)
How LLMs Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLMs [13.822169295436177]
We investigate how large language models (LLMs) process the temporal meaning of linguistic aspect in narratives that were previously used in human studies.<n>Our findings show that LLMs over-rely on prototypicality, produce inconsistent aspectual judgments, and struggle with causal reasoning derived from aspect.<n>These results suggest that LLMs process aspect fundamentally differently from humans and lack robust narrative understanding.
arXiv Detail & Related papers (2025-07-18T18:28:35Z)
Psycholinguistic Word Features: a New Approach for the Evaluation of LLMs Alignment with Humans [2.7013338932521416]
We evaluate the alignment of a representative group of LLMs with human ratings on psycholinguistic datasets.<n>The results show that alignment is textcolorblackgenerally better in the Glasgow norms evaluated.<n>This suggests a potential limitation of current LLMs in aligning with human sensory associations for words.
arXiv Detail & Related papers (2025-05-29T20:56:48Z)
Probing Association Biases in LLM Moderation Over-Sensitivity [42.191744175730726]
Large Language Models are widely used for content moderation but often misclassify benign comments as toxic.<n>We introduce Topic Association Analysis, a semantic-level approach to quantify how LLMs associate certain topics with toxicity.<n>More advanced models (e.g., GPT-4 Turbo) demonstrate stronger topic stereotype despite lower overall false positive rates.
arXiv Detail & Related papers (2025-05-29T18:07:48Z)
How Deep is Love in LLMs' Hearts? Exploring Semantic Size in Human-like Cognition [75.11808682808065]
This study investigates whether large language models (LLMs) exhibit similar tendencies in understanding semantic size.<n>Our findings reveal that multi-modal training is crucial for LLMs to achieve more human-like understanding.<n> Lastly, we examine whether LLMs are influenced by attention-grabbing headlines with larger semantic sizes in a real-world web shopping scenario.
arXiv Detail & Related papers (2025-03-01T03:35:56Z)
Is LLM an Overconfident Judge? Unveiling the Capabilities of LLMs in Detecting Offensive Language with Annotation Disagreement [22.992484902761994]
This study systematically evaluates the performance of multiple Large Language Models (LLMs) in detecting offensive language.<n>We analyze binary classification accuracy, examine the relationship between model confidence and human disagreement, and explore how disagreement samples influence model decision-making.
arXiv Detail & Related papers (2025-02-10T07:14:26Z)
Investigating large language models for their competence in extracting grammatically sound sentences from transcribed noisy utterances [1.3597551064547497]
Humans exhibit remarkable cognitive abilities to separate semantically significant content from speech-specific noise. We investigate whether large language models (LLMs) can effectively perform analogical speech comprehension tasks.
arXiv Detail & Related papers (2024-10-07T14:55:20Z)
Hate Personified: Investigating the role of LLMs in content moderation [64.26243779985393]
For subjective tasks such as hate detection, where people perceive hate differently, the Large Language Model's (LLM) ability to represent diverse groups is unclear. By including additional context in prompts, we analyze LLM's sensitivity to geographical priming, persona attributes, and numerical information to assess how well the needs of various groups are reflected.
arXiv Detail & Related papers (2024-10-03T16:43:17Z)
A Methodology for Explainable Large Language Models with Integrated Gradients and Linguistic Analysis in Text Classification [2.556395214262035]
Neurological disorders that affect speech production, such as Alzheimer's Disease (AD), significantly impact the lives of both patients and caregivers. Recent advancements in Large Language Model (LLM) architectures have developed many tools to identify representative features of neurological disorders through spontaneous speech. This paper presents an explainable LLM method, named SLIME, capable of identifying lexical components representative of AD.
arXiv Detail & Related papers (2024-09-30T21:45:02Z)
Potential and Limitations of LLMs in Capturing Structured Semantics: A Case Study on SRL [78.80673954827773]
Large Language Models (LLMs) play a crucial role in capturing structured semantics to enhance language understanding, improve interpretability, and reduce bias. We propose using Semantic Role Labeling (SRL) as a fundamental task to explore LLMs' ability to extract structured semantics. We find interesting potential: LLMs can indeed capture semantic structures, and scaling-up doesn't always mirror potential. We are surprised to discover that significant overlap in the errors is made by both LLMs and untrained humans, accounting for almost 30% of all errors.
arXiv Detail & Related papers (2024-05-10T11:44:05Z)
The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM) We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions. Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z)
FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks. We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z)
Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding [1.3654846342364308]
Large Language Models (LLMs) are unparalleled in their ability to generate grammatically correct, fluent text. This position paper critically assesses three points recurring in critiques of LLM capacities. We outline a pragmatic perspective on the issue of real' understanding and intentionality in LLMs.
arXiv Detail & Related papers (2023-10-30T15:51:04Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.