Exploiting Cross-Lingual Knowledge in Unsupervised Acoustic Modeling for
Low-Resource Languages
- URL: http://arxiv.org/abs/2007.15074v1
- Date: Wed, 29 Jul 2020 19:45:17 GMT
- Title: Exploiting Cross-Lingual Knowledge in Unsupervised Acoustic Modeling for
Low-Resource Languages
- Authors: Siyuan Feng
- Abstract summary: Unsupervised acoustic modeling (UAM) for automatic speech recognition (ASR) in the zero-resource scenario.
First problem concerns unsupervised discovery of basic (subword level) speech units in a given language.
Second problem is referred to as unsupervised subword modeling.
- Score: 14.297371692669545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: (Short version of Abstract) This thesis describes an investigation on
unsupervised acoustic modeling (UAM) for automatic speech recognition (ASR) in
the zero-resource scenario, where only untranscribed speech data is assumed to
be available. UAM is not only important in addressing the general problem of
data scarcity in ASR technology development but also essential to many
non-mainstream applications, for examples, language protection, language
acquisition and pathological speech assessment. The present study is focused on
two research problems. The first problem concerns unsupervised discovery of
basic (subword level) speech units in a given language. Under the zero-resource
condition, the speech units could be inferred only from the acoustic signals,
without requiring or involving any linguistic direction and/or constraints. The
second problem is referred to as unsupervised subword modeling. In its essence
a frame-level feature representation needs to be learned from untranscribed
speech. The learned feature representation is the basis of subword unit
discovery. It is desired to be linguistically discriminative and robust to
non-linguistic factors. Particularly extensive use of cross-lingual knowledge
in subword unit discovery and modeling is a focus of this research.
Related papers
- Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Multilingual acoustic word embeddings for zero-resource languages [1.5229257192293204]
It specifically uses acoustic word embedding (AWE) -- fixed-dimensional representations of variable-duration speech segments.
The study introduces a new neural network that outperforms existing AWE models on zero-resource languages.
AWEs are applied to a keyword-spotting system for hate speech detection in Swahili radio broadcasts.
arXiv Detail & Related papers (2024-01-19T08:02:37Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z) - Unsupervised Automatic Speech Recognition: A Review [2.6212127510234797]
We review the research literature to identify models and ideas that could lead to fully unsupervised ASR.
The objective of the study is to identify the limitations of what can be learned from speech data alone and to understand the minimum requirements for speech recognition.
arXiv Detail & Related papers (2021-06-09T08:33:20Z) - Unsupervised Pattern Discovery from Thematic Speech Archives Based on
Multilingual Bottleneck Features [41.951988293049205]
We propose a two-stage approach, which comprises unsupervised acoustic modeling and decoding, followed by pattern mining in acoustic unit sequences.
The proposed system is able to effectively extract topic-related words and phrases from the lecture recordings on MIT OpenCourseWare.
arXiv Detail & Related papers (2020-11-03T20:06:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.