Self-supervised language learning from raw audio: Lessons from the Zero
Resource Speech Challenge
- URL: http://arxiv.org/abs/2210.15759v1
- Date: Thu, 27 Oct 2022 20:32:41 GMT
- Title: Self-supervised language learning from raw audio: Lessons from the Zero
Resource Speech Challenge
- Authors: Ewan Dunbar, Nicolas Hamilakis and Emmanuel Dupoux
- Abstract summary: Self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio.
The Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defined tasks.
We present an overview of the six editions of this challenge series since 2015, discuss the lessons learned, and outline the areas which need more work or give puzzling results.
- Score: 15.67794428589585
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent progress in self-supervised or unsupervised machine learning has
opened the possibility of building a full speech processing system from raw
audio without using any textual representations or expert labels such as
phonemes, dictionaries or parse trees. The contribution of the Zero Resource
Speech Challenge series since 2015 has been to break down this long-term
objective into four well-defined tasks -- Acoustic Unit Discovery, Spoken Term
Discovery, Discrete Resynthesis, and Spoken Language Modeling -- and introduce
associated metrics and benchmarks enabling model comparison and cumulative
progress. We present an overview of the six editions of this challenge series
since 2015, discuss the lessons learned, and outline the areas which need more
work or give puzzling results.
Related papers
- Roadmap towards Superhuman Speech Understanding using Large Language Models [60.57947401837938]
Large language models (LLMs) integrate speech and audio data.
Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs.
We propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models.
arXiv Detail & Related papers (2024-10-17T06:44:06Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Automated Audio Captioning: an Overview of Recent Progress and New
Challenges [56.98522404673527]
Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips.
We present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets.
arXiv Detail & Related papers (2022-05-12T08:36:35Z) - The Zero Resource Speech Benchmark 2021: Metrics and baselines for
unsupervised spoken language modeling [23.517751578968344]
We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels.
We present the results and analyses of a composite baseline made of self-supervised contrastive representation learning (CPC), clustering (k-means) and language modeling (LSTM or BERT)
This simple pipeline shows better than chance performance on all four metrics, demonstrating the feasibility of spoken language modeling from raw speech.
arXiv Detail & Related papers (2020-11-23T18:01:37Z) - The Zero Resource Speech Challenge 2020: Discovering discrete subword
and word units [40.41406551797358]
Zero Resource Speech Challenge 2020 aims at learning speech representations from raw audio signals without any labels.
We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.
arXiv Detail & Related papers (2020-10-12T18:56:48Z) - Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields.
We discuss state-of-the-art methods as well as the remaining challenges of each subfield.
We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.