MASR: Multi-label Aware Speech Representation
- URL: http://arxiv.org/abs/2307.10982v2
- Date: Mon, 25 Sep 2023 12:49:00 GMT
- Title: MASR: Multi-label Aware Speech Representation
- Authors: Anjali Raj, Shikhar Bharadwaj, Sriram Ganapathy, Min Ma, Shikhar
Vashishth
- Abstract summary: We propose MASR, a Multi-label Aware Speech Representation learning framework.
MASR enables the inclusion of multiple external knowledge sources to enhance the utilization of meta-data information.
We show significant performance improvements for the MASR over other established benchmarks.
- Score: 36.2978180342839
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the recent years, speech representation learning is constructed primarily
as a self-supervised learning (SSL) task, using the raw audio signal alone,
while ignoring the side-information that is often available for a given speech
recording. In this paper, we propose MASR, a Multi-label Aware Speech
Representation learning framework, which addresses the aforementioned
limitations. MASR enables the inclusion of multiple external knowledge sources
to enhance the utilization of meta-data information. The external knowledge
sources are incorporated in the form of sample-level pair-wise similarity
matrices that are useful in a hard-mining loss. A key advantage of the MASR
framework is that it can be combined with any choice of SSL method. Using MASR
representations, we perform evaluations on several downstream tasks such as
language identification, speech recognition and other non-semantic tasks such
as speaker and emotion recognition. In these experiments, we illustrate
significant performance improvements for the MASR over other established
benchmarks. We perform a detailed analysis on the language identification task
to provide insights on how the proposed loss function enables the
representations to separate closely related languages.
Related papers
- Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data [30.966072545451183]
We propose a MutltiLingual MultiTask (MLMT) model, integrating multilingual speech generation and recognition tasks within the single LLM.
We develop an effective data construction approach that splits and equips words from different languages to equip synthesiss with CS ability without relying on CS data.
arXiv Detail & Related papers (2024-09-17T08:11:07Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - SSHR: Leveraging Self-supervised Hierarchical Representations for Multilingual Automatic Speech Recognition [9.853451215277346]
We propose a novel method that leverages self-supervised hierarchical representations (SSHR) to fine-tune the MMS model.
We evaluate SSHR on two multilingual datasets, Common Voice and ML-SUPERB, and the experimental results demonstrate that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-09-29T02:35:36Z) - Label Aware Speech Representation Learning For Language Identification [49.197215416945596]
We propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task.
This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function.
arXiv Detail & Related papers (2023-06-07T12:14:16Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Unsupervised Automatic Speech Recognition: A Review [2.6212127510234797]
We review the research literature to identify models and ideas that could lead to fully unsupervised ASR.
The objective of the study is to identify the limitations of what can be learned from speech data alone and to understand the minimum requirements for speech recognition.
arXiv Detail & Related papers (2021-06-09T08:33:20Z) - LeBenchmark: A Reproducible Framework for Assessing Self-Supervised
Representation Learning from Speech [63.84741259993937]
Self-Supervised Learning (SSL) using huge unlabeled data has been successfully explored for image and natural language processing.
Recent works also investigated SSL from speech.
We propose LeBenchmark: a reproducible framework for assessing SSL from speech.
arXiv Detail & Related papers (2021-04-23T08:27:09Z) - General-Purpose Speech Representation Learning through a Self-Supervised
Multi-Granularity Framework [114.63823178097402]
This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning.
Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic information at large time scales.
arXiv Detail & Related papers (2021-02-03T08:13:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.