Visual Stereotypes of Autism Spectrum in Janus-Pro-7B, DALL-E, Stable Diffusion, SDXL, FLUX, and Midjourney
- URL: http://arxiv.org/abs/2407.16292v3
- Date: Thu, 16 Oct 2025 11:56:26 GMT
- Title: Visual Stereotypes of Autism Spectrum in Janus-Pro-7B, DALL-E, Stable Diffusion, SDXL, FLUX, and Midjourney
- Authors: Maciej Wodziński, Marcin Rządeczka, Anastazja Szuła, Kacper Dudzic, Marcin Moskalewicz,
- Abstract summary: This study examined whether six text-to-image models perpetuate non-rational beliefs regarding autism by comparing images generated in 2024-2025 with controls.<n>Autistic individuals were depicted with striking homogeneity in skin color (white), gender (male), and age (young), often engaged in solitary activities, interacting with objects rather than people, and exhibiting stereotypical emotional expressions such as sadness, anger, or emotional flatness.<n>We found significant differences between the models; however, with a moderate effect size, and no differences between baseline and follow-up summary values, with the ratio of stereotypical themes to the number of images similar across all
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Avoiding systemic discrimination of neurodiverse individuals is an ongoing challenge in training AI models, which often propagate negative stereotypes. This study examined whether six text-to-image models (Janus-Pro-7B VL2 vs. VL3, DALL-E 3 v. April 2024 vs. August 2025, Stable Diffusion v. 1.6 vs. 3.5, SDXL v. April 2024 vs. FLUX.1 Pro, and Midjourney v. 5.1 vs. 7) perpetuate non-rational beliefs regarding autism by comparing images generated in 2024-2025 with controls. 53 prompts aimed at neutrally visualizing concrete objects and abstract concepts related to autism were used against 53 controls (baseline total N=302, follow-up experimental 280 images plus 265 controls). Expert assessment measuring the presence of common autism-related stereotypes employed a framework of 10 deductive codes followed by statistical analysis. Autistic individuals were depicted with striking homogeneity in skin color (white), gender (male), and age (young), often engaged in solitary activities, interacting with objects rather than people, and exhibiting stereotypical emotional expressions such as sadness, anger, or emotional flatness. In contrast, the images of neurotypical individuals were more diverse and lacked such traits. We found significant differences between the models; however, with a moderate effect size, and no differences between baseline and follow-up summary values, with the ratio of stereotypical themes to the number of images similar across all models. The control prompts showed a significantly lower degree of stereotyping with large size effects, confirming the hidden biases of the models. In summary, despite improvements in the technical aspects of image generation, the level of reproduction of potentially harmful autism-related stereotypes remained largely unaffected.
Related papers
- Toward Reliable and Explainable Nail Disease Classification: Leveraging Adversarial Training and Grad-CAM Visualization [0.0]
This paper presents a machine learning-based model for automated classification of nail diseases based on a publicly available dataset.<n>Four well-known CNN models-InceptionV3, DenseNet201, EfficientNetV2, and ResNet50 were trained and analyzed.<n>InceptionV3 outperformed the others with an accuracy of 95.57%, while DenseNet201 came next with 94.79%.
arXiv Detail & Related papers (2026-02-04T18:08:13Z) - Artificial Rigidities vs. Biological Noise: A Comparative Analysis of Multisensory Integration in AV-HuBERT and Human Observers [0.0]
This study evaluates AV-HuBERT's perceptual bio-fidelity by benchmarking it against human observers.<n>Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates.
arXiv Detail & Related papers (2026-01-22T11:18:16Z) - KidVis: Do Multimodal Large Language Models Possess the Visual Perceptual Capabilities of a 6-Year-Old? [79.27736230305516]
We introduce KidVis, a novel benchmark grounded in the theory of human visual development.<n> evaluating 20 state-of-the-art MLLMs against a human physiological baseline reveals a stark performance disparity.<n>This study confirms that current MLLMs, despite their reasoning prowess, lack the essential physiological perceptual primitives required for generalized visual intelligence.
arXiv Detail & Related papers (2026-01-13T07:32:50Z) - Mechanisms of Prompt-Induced Hallucination in Vision-Language Models [58.991412160253276]
We study the failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image.<n>We identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training.<n>Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.
arXiv Detail & Related papers (2026-01-08T18:23:03Z) - When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models [4.240144901142787]
We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for measuring such biases.<n>Our approach compares visual attributes of objects generated with demographic cues to those from neutral prompts.<n>We uncover strong associations between specific demographic groups and visual attributes, such as recurring color patterns prompted by gender or ethnicity cues.
arXiv Detail & Related papers (2025-08-05T14:15:53Z) - Hidden Bias in the Machine: Stereotypes in Text-to-Image Models [0.0]
Text-to-Image (T2I) models have transformed visual content creation, producing highly realistic images from natural language prompts.<n>We curated a diverse set of prompts spanning thematic categories such as occupations, traits, actions, ideologies, emotions, family roles, place descriptions, spirituality, and life events.<n>For each of the 160 unique topics, we crafted multiple prompt variations to reflect a wide range of meanings and perspectives.<n>Our analysis reveals significant disparities in the representation of gender, race, age, somatotype, and other human-centric factors across generated images.
arXiv Detail & Related papers (2025-06-09T23:06:04Z) - Fact-or-Fair: A Checklist for Behavioral Testing of AI Models on Fairness-Related Queries [85.909363478929]
In this study, we focus on 19 real-world statistics collected from authoritative sources.
We develop a checklist comprising objective and subjective queries to analyze behavior of large language models.
We propose metrics to assess factuality and fairness, and formally prove the inherent trade-off between these two aspects.
arXiv Detail & Related papers (2025-02-09T10:54:11Z) - Hugging Rain Man: A Novel Facial Action Units Dataset for Analyzing Atypical Facial Expressions in Children with Autism Spectrum Disorder [2.3001245059699014]
We introduce a novel dataset, Hugging Rain Man, which includes facial action units (AUs) manually annotated by FACS experts for both children with ASD and typical development (TD)
The dataset comprises a rich collection of posed and spontaneous facial expressions, totaling approximately 130,000 frames, along with 22 AUs, 10 Action Descriptors (ADs) and atypicality ratings.
arXiv Detail & Related papers (2024-11-21T02:51:52Z) - Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation [58.77994391566484]
We propose W1KP, a human-calibrated measure of variability in a set of images.
Our best perceptual distance outperforms nine baselines by up to 18 points in accuracy.
We analyze 56 linguistic features of real prompts, finding that the prompt's length, CLIP embedding norm, concreteness, and word senses influence variability most.
arXiv Detail & Related papers (2024-06-12T17:59:27Z) - Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration.
In this work, we tackle the task of reconstructing closely interactive humans from a monocular video.
We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z) - Hear Me, See Me, Understand Me: Audio-Visual Autism Behavior Recognition [47.550391816383794]
We introduce a novel problem of audio-visual autism behavior recognition.
Social behavior recognition is an essential aspect previously omitted in AI-assisted autism screening research.
We will release our dataset, code, and pre-trained models.
arXiv Detail & Related papers (2024-03-22T22:52:35Z) - The Male CEO and the Female Assistant: Evaluation and Mitigation of Gender Biases in Text-To-Image Generation of Dual Subjects [58.27353205269664]
We propose the Paired Stereotype Test (PST) framework, which queries T2I models to depict two individuals assigned with male-stereotyped and female-stereotyped social identities.<n>PST queries T2I models to depict two individuals assigned with male-stereotyped and female-stereotyped social identities.<n>Using PST, we evaluate two aspects of gender biases -- the well-known bias in gendered occupation and a novel aspect: bias in organizational power.
arXiv Detail & Related papers (2024-02-16T21:32:27Z) - Decoding Susceptibility: Modeling Misbelief to Misinformation Through a Computational Approach [61.04606493712002]
Susceptibility to misinformation describes the degree of belief in unverifiable claims that is not observable.
Existing susceptibility studies heavily rely on self-reported beliefs.
We propose a computational approach to model users' latent susceptibility levels.
arXiv Detail & Related papers (2023-11-16T07:22:56Z) - Robust and Interpretable Medical Image Classifiers via Concept
Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts.
Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z) - Stable Bias: Analyzing Societal Representations in Diffusion Models [72.27121528451528]
We propose a new method for exploring the social biases in Text-to-Image (TTI) systems.
Our approach relies on characterizing the variation in generated images triggered by enumerating gender and ethnicity markers in the prompts.
We leverage this method to analyze images generated by 3 popular TTI systems and find that while all of their outputs show correlations with US labor demographics, they also consistently under-represent marginalized identities to different extents.
arXiv Detail & Related papers (2023-03-20T19:32:49Z) - Improving Deep Facial Phenotyping for Ultra-rare Disorder Verification
Using Model Ensembles [52.77024349608834]
We analyze the influence of replacing a DCNN with a state-of-the-art face recognition approach, iResNet with ArcFace.
Our proposed ensemble model achieves state-of-the-art performance on both seen and unseen disorders.
arXiv Detail & Related papers (2022-11-12T23:28:54Z) - Easily Accessible Text-to-Image Generation Amplifies Demographic
Stereotypes at Large Scale [61.555788332182395]
We investigate the potential for machine learning models to amplify dangerous and complex stereotypes.
We find a broad range of ordinary prompts produce stereotypes, including prompts simply mentioning traits, descriptors, occupations, or objects.
arXiv Detail & Related papers (2022-11-07T18:31:07Z) - Spatio-Temporal Attention in Multi-Granular Brain Chronnectomes for
Detection of Autism Spectrum Disorder [5.908259551646475]
Graph-based learning techniques have demonstrated impressive results on resting-state functional magnetic resonance imaging (rs-fMRI) data.
IMAGIN achieves a 5-fold cross-validation accuracy of 79.25%, which surpasses the current state-of-the-art by 1.5%.
arXiv Detail & Related papers (2022-10-30T01:43:17Z) - A Two-stage Multi-modal Affect Analysis Framework for Children with
Autism Spectrum Disorder [3.029434408969759]
We present an open-source two-stage multi-modal approach leveraging acoustic and visual cues to predict three main affect states of children with ASD's affect states in real-world play therapy scenarios.
This work presents a novel way to combine human expertise and machine intelligence for ASD affect recognition by proposing a two-stage schema.
arXiv Detail & Related papers (2021-06-17T01:28:53Z) - Diagnosis of Autism in Children using Facial Analysis and Deep Learning [0.0]
We introduce a deep learning model to classify children as either healthy or potentially autistic with 94.6% accuracy using Deep Learning.
Autistic patients struggle with social skills, repetitive behaviors, and communication, both verbal and nonverbal.
Based on our accuracy, we propose that the diagnosis of autism can be done effectively using only a picture.
arXiv Detail & Related papers (2020-08-06T22:15:20Z) - Gaze-based Autism Detection for Adolescents and Young Adults using
Prosaic Videos [35.54632105027475]
We demonstrate that by monitoring a user's gaze as they watch commonplace (i.e., not specialized, structured or coded) video, we can identify individuals with autism spectrum disorder.
We recruited 35 autistic and 25 non-autistic individuals, and captured their gaze using an off-the-shelf eye tracker connected to a laptop. Within 15 seconds, our approach was 92.5% accurate at identifying individuals with an autism diagnosis.
arXiv Detail & Related papers (2020-05-26T18:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.