SCORE: Self-supervised Correspondence Fine-tuning for Improved Content
Representations
- URL: http://arxiv.org/abs/2403.06260v1
- Date: Sun, 10 Mar 2024 16:57:51 GMT
- Title: SCORE: Self-supervised Correspondence Fine-tuning for Improved Content
Representations
- Authors: Amit Meghanani and Thomas Hain
- Abstract summary: This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks.
SCORE outperforms the vanilla HuBERT on SUPERB benchmark with only a few hours of fine-tuning ( 5 hrs) on a single GPU for automatic speech recognition, phoneme recognition, and query-by-example tasks.
- Score: 23.56580783289533
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is a growing interest in cost-effective self-supervised fine-tuning
(SSFT) of self-supervised learning (SSL)-based speech models to obtain
task-specific representations. These task-specific representations are used for
robust performance on various downstream tasks by fine-tuning on the labelled
data. This work presents a cost-effective SSFT method named Self-supervised
Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for
content-related tasks. The proposed method uses a correspondence training
strategy, aiming to learn similar representations from perturbed speech and
original speech. Commonly used data augmentation techniques for content-related
tasks (ASR) are applied to obtain perturbed speech. SCORE fine-tuned HuBERT
outperforms the vanilla HuBERT on SUPERB benchmark with only a few hours of
fine-tuning (< 5 hrs) on a single GPU for automatic speech recognition, phoneme
recognition, and query-by-example tasks, with relative improvements of 1.09%,
3.58%, and 12.65%, respectively. SCORE provides competitive results with the
recently proposed SSFT method SPIN, using only 1/3 of the processed speech
compared to SPIN.
Related papers
- LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks [19.94790551312789]
A cost-effective self-supervised fine-tuning (SSFT) method named "LASER: Learning by Aligning Self-supervised Representations" is presented.
Experiments are conducted with HuBERT and WavLM models and evaluated on the SUPERB benchmark for two content-related tasks: automatic speech recognition (ASR) and phoneme recognition (PR)
A relative improvement of 3.7% and 8.2% for HuBERT, and 4.1% and 11.7% for WavLM are observed, for the ASR and PR tasks respectively, with only 3 hours of fine-tuning on a single GPU.
arXiv Detail & Related papers (2024-06-13T14:17:47Z) - Towards Selection of Text-to-speech Data to Augment ASR Training [20.115236045164355]
We train a neural network to measure the similarity of a synthetic data to real speech.
We find that incorporating synthetic samples with considerable dissimilarity to real speech is crucial for boosting recognition performance.
arXiv Detail & Related papers (2023-05-30T17:24:28Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Representative Subset Selection for Efficient Fine-Tuning in
Self-Supervised Speech Recognition [6.450618373898492]
We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR.
We present the COWERAGE algorithm for representative subset selection in self-supervised ASR.
arXiv Detail & Related papers (2022-03-18T10:12:24Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - An Effective Contextual Language Modeling Framework for Speech
Summarization with Augmented Features [13.97006782398121]
Bidirectional Representations from Transformers (BERT) model was proposed and has achieved record-breaking success on many natural language processing tasks.
We explore the incorporation of confidence scores into sentence representations to see if such an attempt could help alleviate the negative effects caused by imperfect automatic speech recognition.
We validate the effectiveness of our proposed method on a benchmark dataset.
arXiv Detail & Related papers (2020-06-01T18:27:48Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.