Unsupervised Automatic Speech Recognition: A Review
- URL: http://arxiv.org/abs/2106.04897v1
- Date: Wed, 9 Jun 2021 08:33:20 GMT
- Title: Unsupervised Automatic Speech Recognition: A Review
- Authors: Hanan Aldarmaki, Asad Ullah, Nazar Zaki
- Abstract summary: We review the research literature to identify models and ideas that could lead to fully unsupervised ASR.
The objective of the study is to identify the limitations of what can be learned from speech data alone and to understand the minimum requirements for speech recognition.
- Score: 2.6212127510234797
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automatic Speech Recognition (ASR) systems can be trained to achieve
remarkable performance given large amounts of manually transcribed speech, but
large labeled data sets can be difficult or expensive to acquire for all
languages of interest. In this paper, we review the research literature to
identify models and ideas that could lead to fully unsupervised ASR, including
unsupervised segmentation of the speech signal, unsupervised mapping from
speech segments to text, and semi-supervised models with nominal amounts of
labeled examples. The objective of the study is to identify the limitations of
what can be learned from speech data alone and to understand the minimum
requirements for speech recognition. Identifying these limitations would help
optimize the resources and efforts in ASR development for low-resource
languages.
Related papers
- Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Label Aware Speech Representation Learning For Language Identification [49.197215416945596]
We propose a novel framework of combining self-supervised representation learning with the language label information for the pre-training task.
This framework, termed as Label Aware Speech Representation (LASR) learning, uses a triplet based objective function to incorporate language labels along with the self-supervised loss function.
arXiv Detail & Related papers (2023-06-07T12:14:16Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Unsupervised Pattern Discovery from Thematic Speech Archives Based on
Multilingual Bottleneck Features [41.951988293049205]
We propose a two-stage approach, which comprises unsupervised acoustic modeling and decoding, followed by pattern mining in acoustic unit sequences.
The proposed system is able to effectively extract topic-related words and phrases from the lecture recordings on MIT OpenCourseWare.
arXiv Detail & Related papers (2020-11-03T20:06:48Z) - Exploiting Cross-Lingual Knowledge in Unsupervised Acoustic Modeling for
Low-Resource Languages [14.297371692669545]
Unsupervised acoustic modeling (UAM) for automatic speech recognition (ASR) in the zero-resource scenario.
First problem concerns unsupervised discovery of basic (subword level) speech units in a given language.
Second problem is referred to as unsupervised subword modeling.
arXiv Detail & Related papers (2020-07-29T19:45:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.