Related papers: Overlapped speech and gender detection with WavLM pre-trained features

Overlapped speech and gender detection with WavLM pre-trained features

URL: http://arxiv.org/abs/2209.04167v1
Date: Fri, 9 Sep 2022 08:00:47 GMT
Title: Overlapped speech and gender detection with WavLM pre-trained features
Authors: Martin Lebourdais, Marie Tahon, Antoine Laurent, Sylvain Meignier
Abstract summary: This article focuses on overlapped speech and gender detection in order to study interactions between women and men in French audiovisual media. We propose to use WavLM model which has the advantage of being pre-trained on a huge amount of speech data. A neural GD is trained with WavLM inputs on a gender balanced subset of the French broadcast news ALLIES data, and obtains an accuracy of 97.9%.
Score: 6.054285771277486
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This article focuses on overlapped speech and gender detection in order to study interactions between women and men in French audiovisual media (Gender Equality Monitoring project). In this application context, we need to automatically segment the speech signal according to speakers gender, and to identify when at least two speakers speak at the same time. We propose to use WavLM model which has the advantage of being pre-trained on a huge amount of speech data, to build an overlapped speech detection (OSD) and a gender detection (GD) systems. In this study, we use two different corpora. The DIHARD III corpus which is well adapted for the OSD task but lack gender information. The ALLIES corpus fits with the project application context. Our best OSD system is a Temporal Convolutional Network (TCN) with WavLM pre-trained features as input, which reaches a new state-of-the-art F1-score performance on DIHARD. A neural GD is trained with WavLM inputs on a gender balanced subset of the French broadcast news ALLIES data, and obtains an accuracy of 97.9%. This work opens new perspectives for human science researchers regarding the differences of representation between women and men in French media.

Related papers

Addressing speaker gender bias in large scale speech translation systems [20.698663542717544]
This study addresses the issue of speaker gender bias in Speech Translation (ST) systems. We employ Large Language Models (LLMs) to rectify translations based on the speaker's gender in a cost-effective manner. We demonstrate a 70% improvement in translations for female speakers compared to our baseline and other large-scale ST systems.
arXiv Detail & Related papers (2025-01-10T14:20:46Z)
Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem. This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference. We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z)
Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps [25.95711246919163]
Current automatic speech recognition (ASR) models are designed to be used across many languages and tasks without substantial changes. Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets. Our findings reveal clear gender disparities, with the advantaged group varying across languages and models.
arXiv Detail & Related papers (2024-02-28T00:24:29Z)
How To Build Competitive Multi-gender Speech Translation Models For Controlling Speaker Gender Translation [21.125217707038356]
When translating from notional gender languages into grammatical gender languages, the generated translation requires explicit gender assignments for various words, including those referring to the speaker. To avoid such biased and not inclusive behaviors, the gender assignment of speaker-related expressions should be guided by externally-provided metadata about the speaker's gender. This paper aims to achieve the same results by integrating the speaker's gender metadata into a single "multi-gender" neural ST model, easier to maintain.
arXiv Detail & Related papers (2023-10-23T17:21:32Z)
No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition through Pitch Manipulation [20.731375136671605]
We propose a data augmentation technique that manipulates the fundamental frequency (f0) and formants. This technique reduces the data unbalance among genders by simulating voices of the under-represented female speakers. Experiments on spontaneous English speech show that our technique yields a relative WER improvement up to 9.87% for utterances by female speakers.
arXiv Detail & Related papers (2023-10-10T12:55:22Z)
The Gender-GAP Pipeline: A Gender-Aware Polyglot Pipeline for Gender Characterisation in 55 Languages [51.2321117760104]
This paper describes the Gender-GAP Pipeline, an automatic pipeline to characterize gender representation in large-scale datasets for 55 languages. The pipeline uses a multilingual lexicon of gendered person-nouns to quantify the gender representation in text. We showcase it to report gender representation in WMT training data and development data for the News task, confirming that current data is skewed towards masculine representation.
arXiv Detail & Related papers (2023-08-31T17:20:50Z)
VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution [80.57383975987676]
VisoGender is a novel dataset for benchmarking gender bias in vision-language models. We focus on occupation-related biases within a hegemonic system of binary gender, inspired by Winograd and Winogender schemas. We benchmark several state-of-the-art vision-language models and find that they demonstrate bias in resolving binary gender in complex scenes.
arXiv Detail & Related papers (2023-06-21T17:59:51Z)
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation [55.1650189699753]
Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech. We present AV-TranSpeech, the first audio-visual speech-to-speech model without relying on intermediate text.
arXiv Detail & Related papers (2023-05-24T17:59:03Z)
A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems [25.468558523679363]
We train and compare gender-specific wav2vec 2.0 models against models containing different degrees of gender balance in pre-training data. We observe lower overall performance using gender-specific pre-training before fine-tuning an end-to-end ASR system.
arXiv Detail & Related papers (2022-04-04T11:28:19Z)
NeuraGen-A Low-Resource Neural Network based approach for Gender Classification [0.0]
We have used speech recordings collected from the ELSDSR and limited TIMIT datasets. We extracted 8 speech features, which were pre-processed and then fed into NeuraGen to identify the gender. NeuraGen has successfully achieved accuracy of 90.7407% and F1 score of 91.227% in train and 20-fold cross validation dataset.
arXiv Detail & Related papers (2022-03-29T05:57:24Z)
Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language. We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.