Related papers: Who Gets Left Behind? Auditing Disability Inclusivity in Large Language Models

Who Gets Left Behind? Auditing Disability Inclusivity in Large Language Models

URL: http://arxiv.org/abs/2509.00963v1
Date: Sun, 31 Aug 2025 19:12:01 GMT
Title: Who Gets Left Behind? Auditing Disability Inclusivity in Large Language Models
Authors: Deepika Dash, Yeshil Bangera, Mithil Bangera, Gouthami Vadithya, Srikant Panda,
Abstract summary: We present taxonomy aligned benchmark1 of human validated, general purpose accessibility questions.<n>Our benchmark evaluates models along three dimensions: Question-Level Coverage, Disability-Level Coverage, and Depth.<n>Applying this framework to 17 proprietary and open-weight models reveals persistent inclusivity gaps.
Score: 0.6931288002857499
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly used for accessibility guidance, yet many disability groups remain underserved by their advice. To address this gap, we present taxonomy aligned benchmark1 of human validated, general purpose accessibility questions, designed to systematically audit inclusivity across disabilities. Our benchmark evaluates models along three dimensions: Question-Level Coverage (breadth within answers), Disability-Level Coverage (balance across nine disability categories), and Depth (specificity of support). Applying this framework to 17 proprietary and open-weight models reveals persistent inclusivity gaps: Vision, Hearing, and Mobility are frequently addressed, while Speech, Genetic/Developmental, Sensory-Cognitive, and Mental Health remain under served. Depth is similarly concentrated in a few categories but sparse elsewhere. These findings reveal who gets left behind in current LLM accessibility guidance and highlight actionable levers: taxonomy-aware prompting/training and evaluations that jointly audit breadth, balance, and depth.

Related papers

From Knowledge to Inference: Scaling Laws of Specialized Reasoning on GlobalHealthAtlas [1.8594711725515678]
We introduce textbfGlobalHealthAtlas, a large scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages.<n>We propose large language model (LLM) assisted construction and quality control pipeline with retrieval, duplication, evidence grounding checks, and label validation to improve consistency at scale.
arXiv Detail & Related papers (2026-01-31T03:29:30Z)
Auditing Disability Representation in Vision-Language Models [0.6987503477818553]
We study disability aware descriptions for person centric images.<n>We introduce a benchmark based on paired Neutral Prompts (NP) and Disability-Contextualised Prompts (DP)<n>We evaluate 15 state-of-the-art open- and closed-source vision-language models under a zero-shot setting across 9 disability categories.
arXiv Detail & Related papers (2026-01-24T07:25:43Z)
AccessEval: Benchmarking Disability Bias in Large Language Models [3.160274015679566]
Large Language Models (LLMs) are increasingly deployed across diverse domains but often exhibit disparities in how they handle real-life queries.<n>We introduce textbfAccessEval (Accessibility Evaluation), a benchmark evaluating 21 closed- and open-source LLMs across 6 real-world domains and 9 disability types.<n>Our analysis reveals that responses to disability-aware queries tend to have a more negative tone, increased stereotyping, and higher factual error compared to neutral queries.
arXiv Detail & Related papers (2025-09-22T17:49:03Z)
Who's Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs [2.722784054643991]
Large Language Models (LLMs) routinely infer users demographic traits from phrasing alone.<n>Disability cues in shaping these inferences remains largely uncharted.<n>We present the first systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs.
arXiv Detail & Related papers (2025-08-18T21:03:09Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
Beyond Keywords: Evaluating Large Language Model Classification of Nuanced Ableism [2.0435202333125977]
Large language models (LLMs) are increasingly used in decision-making tasks like r'esum'e screening and content moderation.<n>We evaluate the ability of four LLMs to identify nuanced ableism directed at autistic individuals.<n>Our results reveal that LLMs can identify autism-related language but often miss harmful or offensive connotations.
arXiv Detail & Related papers (2025-05-26T20:01:44Z)
Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey [49.1574468325115]
We conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations.<n>We provide detailed overviews within each category and highlight challenges in this field.<n>We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.
arXiv Detail & Related papers (2025-05-21T19:17:29Z)
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks [229.73714829399802]
This survey probes the core challenges that the rise of Large Language Models poses for evaluation.<n>We identify and analyze two pivotal transitions: (i) from task-specific to capability-based evaluation, which reorganizes benchmarks around core competencies such as knowledge, reasoning, instruction following, multi-modal understanding, and safety.<n>We will dissect this issue, along with the core challenges of the above two transitions, from the perspectives of methods, datasets, evaluators, and metrics.
arXiv Detail & Related papers (2025-04-26T07:48:52Z)
Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective [5.769786334333616]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) based applications including automated text generation, question answering, and others. They face a significant challenge: hallucinations, where models produce plausible-sounding but factually incorrect responses. This paper discusses these open challenges covering state-of-the-art datasets and benchmarks as well as methods for knowledge integration and evaluating hallucinations.
arXiv Detail & Related papers (2024-11-21T16:09:05Z)
CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs [74.36850397755572]
CATCH addresses issues related to visual defects that cause diminished fine-grained feature perception and cumulative hallucinations in open-ended scenarios. It is applicable to various visual question-answering tasks without requiring any specific data or prior knowledge, and generalizes robustly to new tasks without additional training.
arXiv Detail & Related papers (2024-11-19T18:27:31Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
PsyEval: A Suite of Mental Health Related Tasks for Evaluating Large Language Models [34.09419351705938]
This paper presents PsyEval, the first comprehensive suite of mental health-related tasks for evaluating Large Language Models (LLMs) This comprehensive framework is designed to thoroughly assess the unique challenges and intricacies of mental health-related tasks.
arXiv Detail & Related papers (2023-11-15T18:32:27Z)
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial. We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.