Related papers: Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models

Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models

URL: http://arxiv.org/abs/2407.06957v1
Date: Tue, 9 Jul 2024 15:35:43 GMT
Title: Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models
Authors: Yi-Cheng Lin, Tzu-Quan Lin, Chih-Kai Yang, Ke-Han Lu, Wei-Chih Chen, Chun-Yi Kuan, Hung-yi Lee,
Abstract summary: We evaluate gender bias in SILLMs across four semantic-related tasks. Our analysis reveals that bias levels are language-dependent and vary with different evaluation methods.
Score: 38.64792118903994
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Speech Integrated Large Language Models (SILLMs) combine large language models with speech perception to perform diverse tasks, such as emotion recognition to speaker verification, demonstrating universal audio understanding capability. However, these models may amplify biases present in training data, potentially leading to biased access to information for marginalized groups. This work introduces a curated spoken bias evaluation toolkit and corresponding dataset. We evaluate gender bias in SILLMs across four semantic-related tasks: speech-to-text translation (STT), spoken coreference resolution (SCR), spoken sentence continuation (SSC), and spoken question answering (SQA). Our analysis reveals that bias levels are language-dependent and vary with different evaluation methods. Our findings emphasize the necessity of employing multiple approaches to comprehensively assess biases in SILLMs, providing insights for developing fairer SILLM systems.

Related papers

Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations [21.050704978484784]
We construct and release the BiasInEar dataset, a speech-augmented benchmark based on Global MMLU Lite.<n>We evaluate nine representative models under linguistic (language and accent), demographic (gender), and structural (option order) perturbations.<n>Our findings reveal that MLLMs are relatively robust to demographic factors but highly sensitive to language and option order, suggesting that speech can amplify existing structural biases.
arXiv Detail & Related papers (2026-02-01T05:34:34Z)
Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models [49.1574468325115]
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities.<n>Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training.<n>Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs.
arXiv Detail & Related papers (2025-05-25T08:37:55Z)
Audio Large Language Models Can Be Descriptive Speech Quality Evaluators [46.765203628127345]
We introduce the first natural language-based speech evaluation corpus, generated from authentic human ratings. This corpus offers detailed analysis across multiple dimensions and identifies causes of quality degradation. We propose an alignment approach with LLM distillation (ALLD) to guide the audio LLM in extracting relevant information from raw speech.
arXiv Detail & Related papers (2025-01-27T22:47:51Z)
Scaling for Fairness? Analyzing Model Size, Data Composition, and Multilinguality in Vision-Language Bias [14.632649933582648]
We investigate how dataset composition, model size, and multilingual training affect gender and racial bias in a popular VLM, CLIP, and its open source variants. To assess social perception bias, we measure the zero-shot performance on face images featuring socially charged terms.
arXiv Detail & Related papers (2025-01-22T21:08:30Z)
Self-Powered LLM Modality Expansion for Large Speech-Text Models [62.27700381806554]
Large language models (LLMs) exhibit remarkable performance across diverse tasks. This study aims to refine the use of speech datasets for LSM training by addressing the limitations of vanilla instruction tuning. We introduce a self-powered LSM that leverages augmented automatic speech recognition data generated by the model itself for more effective instruction tuning.
arXiv Detail & Related papers (2024-10-04T04:34:24Z)
Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs) By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases. The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z)
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z)
The Balancing Act: Unmasking and Alleviating ASR Biases in Portuguese [5.308321515594125]
This study is dedicated to a comprehensive exploration of the Whisper and MMS systems. Our investigation encompasses various categories, including gender, age, skin tone color, and geo-location. We empirically show that oversampling techniques alleviate such stereotypical biases.
arXiv Detail & Related papers (2024-02-12T09:35:13Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.