Related papers: LLMs Do Not See Age: Assessing Demographic Bias in Automated Systematic Review Synthesis

LLMs Do Not See Age: Assessing Demographic Bias in Automated Systematic Review Synthesis

URL: http://arxiv.org/abs/2511.06000v1
Date: Sat, 08 Nov 2025 13:12:36 GMT
Title: LLMs Do Not See Age: Assessing Demographic Bias in Automated Systematic Review Synthesis
Authors: Favour Yahdii Aghaebe, Tanefa Apekey, Elizabeth Williams, Nafise Sadat Moosavi,
Abstract summary: We evaluate how well state-of-the-art language models retain age-related information when generating abstractive summaries of biomedical studies.<n>We construct DemogSummary, a novel age-stratified dataset of systematic review primary studies.
Score: 10.334277776439423
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Clinical interventions often hinge on age: medications and procedures safe for adults may be harmful to children or ineffective for older adults. However, as language models are increasingly integrated into biomedical evidence synthesis workflows, it remains uncertain whether these systems preserve such crucial demographic distinctions. To address this gap, we evaluate how well state-of-the-art language models retain age-related information when generating abstractive summaries of biomedical studies. We construct DemogSummary, a novel age-stratified dataset of systematic review primary studies, covering child, adult, and older adult populations. We evaluate three prominent summarisation-capable LLMs, Qwen (open-source), Longformer (open-source) and GPT-4.1 Nano (proprietary), using both standard metrics and a newly proposed Demographic Salience Score (DSS), which quantifies age-related entity retention and hallucination. Our results reveal systematic disparities across models and age groups: demographic fidelity is lowest for adult-focused summaries, and under-represented populations are more prone to hallucinations. These findings highlight the limitations of current LLMs in faithful and bias-free summarisation and point to the need for fairness-aware evaluation frameworks and summarisation pipelines in biomedical NLP.

Related papers

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond) [90.45301024940329]
Language models (LMs) often struggle to generate diverse, human-like creative content.<n>We introduce Infinity-Chat, a large-scale dataset of 26K diverse, real-world, open-ended user queries.<n>We present a large-scale study of mode collapse in LMs, revealing a pronounced Artificial Hivemind effect.
arXiv Detail & Related papers (2025-10-27T03:16:21Z)
Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study [1.6682715542079583]
Large-scale web-scraped text corpora used to train general-purpose AI models often contain harmful demographic-targeted social biases.<n>We present a comprehensive evaluation framework aimed at English texts to assess the ability of LLMs in detecting demographic-targeted social biases.<n>We then conduct a systematic evaluation with models across scales and techniques, including prompting, in-context learning, and fine-tuning.
arXiv Detail & Related papers (2025-10-06T09:45:32Z)
Bridging the gap in FER: addressing age bias in deep learning [0.562479170374811]
We study age-related bias in deep FER models, with a particular focus on the elderly population.<n>Using Explainable AI (XAI) techniques, we identify systematic disparities in expression recognition and attention patterns.<n>Results show consistent improvements in recognition accuracy for elderly individuals.
arXiv Detail & Related papers (2025-07-10T11:07:13Z)
Generative AI for Synthetic Data Across Multiple Medical Modalities: A Systematic Review of Recent Developments and Challenges [2.1835659964186087]
This paper presents a systematic review of generative models used to synthesize various medical data types. Our study encompasses a broad array of medical data modalities and explores various generative models.
arXiv Detail & Related papers (2024-06-27T14:00:11Z)
A Demographic-Conditioned Variational Autoencoder for fMRI Distribution Sampling and Removal of Confounds [49.34500499203579]
We create a variational autoencoder (VAE)-based model, DemoVAE, to decorrelate fMRI features from demographics. We generate high-quality synthetic fMRI data based on user-supplied demographics.
arXiv Detail & Related papers (2024-05-13T17:49:20Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
The Generation Gap: Exploring Age Bias in the Value Systems of Large Language Models [26.485974783643464]
We find a general inclination of Large Language Models (LLMs) values towards younger demographics, especially when compared to the US population. Although a general inclination can be observed, we also found that this inclination toward younger groups can be different across different value categories.
arXiv Detail & Related papers (2024-04-12T18:36:20Z)
Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting [64.80538055623842]
sociodemographic prompting is a technique that steers the output of prompt-based models towards answers that humans with specific sociodemographic profiles would give. We show that sociodemographic information affects model predictions and can be beneficial for improving zero-shot learning in subjective NLP tasks.
arXiv Detail & Related papers (2023-09-13T15:42:06Z)
Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs) We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing. We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z)
Relational Subsets Knowledge Distillation for Long-tailed Retinal Diseases Recognition [65.77962788209103]
We propose class subset learning by dividing the long-tailed data into multiple class subsets according to prior knowledge. It enforces the model to focus on learning the subset-specific knowledge. The proposed framework proved to be effective for the long-tailed retinal diseases recognition task.
arXiv Detail & Related papers (2021-04-22T13:39:33Z)
Age-Net: An MRI-Based Iterative Framework for Brain Biological Age Estimation [18.503467872057424]
The concept of biological age (BA) is hard to grasp mainly due to the lack of a clearly defined reference standard. We propose a new imaging-based framework for organ-specific BA estimation.
arXiv Detail & Related papers (2020-09-22T19:04:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.