Cross-domain Voice Activity Detection with Self-Supervised
Representations
- URL: http://arxiv.org/abs/2209.11061v1
- Date: Thu, 22 Sep 2022 14:53:44 GMT
- Title: Cross-domain Voice Activity Detection with Self-Supervised
Representations
- Authors: Sina Alisamir, Fabien Ringeval, Francois Portet
- Abstract summary: Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal.
Current state-of-the-art methods focus on training a neural network exploiting features directly contained in the acoustics.
We show that representations based on Self-Supervised Learning (SSL) can adapt well to different domains.
- Score: 9.02236667251654
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Voice Activity Detection (VAD) aims at detecting speech segments on an audio
signal, which is a necessary first step for many today's speech based
applications. Current state-of-the-art methods focus on training a neural
network exploiting features directly contained in the acoustics, such as Mel
Filter Banks (MFBs). Such methods therefore require an extra normalisation step
to adapt to a new domain where the acoustics is impacted, which can be simply
due to a change of speaker, microphone, or environment. In addition, this
normalisation step is usually a rather rudimentary method that has certain
limitations, such as being highly susceptible to the amount of data available
for the new domain. Here, we exploited the crowd-sourced Common Voice (CV)
corpus to show that representations based on Self-Supervised Learning (SSL) can
adapt well to different domains, because they are computed with contextualised
representations of speech across multiple domains. SSL representations also
achieve better results than systems based on hand-crafted representations
(MFBs), and off-the-shelf VADs, with significant improvement in cross-domain
settings.
Related papers
- Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality
Alignment [57.15449072423539]
We propose a training system Open-modality Speech Recognition (textbfOpenSR)
OpenSR enables modality transfer from one to any in three different settings.
It achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods.
arXiv Detail & Related papers (2023-06-10T11:04:10Z) - Automatic Data Augmentation for Domain Adapted Fine-Tuning of
Self-Supervised Speech Representations [21.423349835589793]
Self-Supervised Learning (SSL) has allowed leveraging large amounts of unlabeled speech data to improve the performance of speech recognition models.
Despite this, speech SSL representations may fail while facing an acoustic mismatch between the pretraining and target datasets.
We propose a novel supervised domain adaptation method, designed for cases exhibiting such a mismatch in acoustic domains.
arXiv Detail & Related papers (2023-06-01T09:30:49Z) - Unsupervised Improvement of Audio-Text Cross-Modal Representations [19.960695758478153]
We study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio.
We show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance.
arXiv Detail & Related papers (2023-05-03T02:30:46Z) - Multi-source Domain Adaptation for Text-independent Forensic Speaker
Recognition [36.83842373791537]
Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model.
Previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple acoustic domains.
Three novel adaptation methods are proposed to further promote adaptation performance across multiple acoustic domains.
arXiv Detail & Related papers (2022-11-17T22:11:25Z) - Audio-Adaptive Activity Recognition Across Video Domains [112.46638682143065]
We leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening.
We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation.
We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically.
arXiv Detail & Related papers (2022-03-27T08:15:20Z) - Unsupervised Domain Adaptation in Speech Recognition using Phonetic
Features [6.872447420442981]
We propose a technique to perform unsupervised gender-based domain adaptation in speech recognition using phonetic features.
Experiments are performed on the TIMIT dataset and there is a considerable decrease in the phoneme error rate using the proposed approach.
arXiv Detail & Related papers (2021-08-04T06:22:12Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - DEAAN: Disentangled Embedding and Adversarial Adaptation Network for
Robust Speaker Representation Learning [69.70594547377283]
We propose a novel framework to disentangle speaker-related and domain-specific features.
Our framework can effectively generate more speaker-discriminative and domain-invariant speaker representations.
arXiv Detail & Related papers (2020-12-12T19:46:56Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.