Spatial HuBERT: Self-supervised Spatial Speech Representation Learning
for a Single Talker from Multi-channel Audio
- URL: http://arxiv.org/abs/2310.10922v1
- Date: Tue, 17 Oct 2023 01:31:59 GMT
- Title: Spatial HuBERT: Self-supervised Spatial Speech Representation Learning
for a Single Talker from Multi-channel Audio
- Authors: Antoni Dimitriadis, Siqi Pan, Vidhyasaharan Sethu, Beena Ahmed
- Abstract summary: This paper presents Spatial HuBERT, a self-supervised speech representation model.
It learns both acoustic and spatial information pertaining to a single speaker in a potentially noisy environment.
It learns representations that outperform state-of-the-art single-channel speech representations on a variety of spatial downstream tasks.
- Score: 7.808211269929968
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning has been used to leverage unlabelled data, improving
accuracy and generalisation of speech systems through the training of
representation models. While many recent works have sought to produce effective
representations across a variety of acoustic domains, languages, modalities and
even simultaneous speakers, these studies have all been limited to
single-channel audio recordings. This paper presents Spatial HuBERT, a
self-supervised speech representation model that learns both acoustic and
spatial information pertaining to a single speaker in a potentially noisy
environment by using multi-channel audio inputs. Spatial HuBERT learns
representations that outperform state-of-the-art single-channel speech
representations on a variety of spatial downstream tasks, particularly in
reverberant and noisy environments. We also demonstrate the utility of the
representations learned by Spatial HuBERT on a speech localisation downstream
task. Along with this paper, we publicly release a new dataset of 100 000
simulated first-order ambisonics room impulse responses.
Related papers
- SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic
Spaces [10.895310812568084]
We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces.
Results show that the proposed model is sensible to phonetic changes.
We provide empirical evidence showing that the resulting embeddings are useful for a variety of downstream applications.
arXiv Detail & Related papers (2023-07-23T22:18:47Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.