Unsupervised Acoustic Unit Discovery by Leveraging a
Language-Independent Subword Discriminative Feature Representation
- URL: http://arxiv.org/abs/2104.00994v1
- Date: Fri, 2 Apr 2021 11:43:07 GMT
- Title: Unsupervised Acoustic Unit Discovery by Leveraging a
Language-Independent Subword Discriminative Feature Representation
- Authors: Siyuan Feng and Piotr \.Zelasko and Laureano Moro-Vel\'azquez and
Odette Scharenborg
- Abstract summary: This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data.
We propose a two-stage approach: the first stage learns a subword-discriminative feature representation and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units.
- Score: 31.87235700253597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper tackles automatically discovering phone-like acoustic units (AUD)
from unlabeled speech data. Past studies usually proposed single-step
approaches. We propose a two-stage approach: the first stage learns a
subword-discriminative feature representation and the second stage applies
clustering to the learned representation and obtains phone-like clusters as the
discovered acoustic units. In the first stage, a recently proposed method in
the task of unsupervised subword modeling is improved by replacing a
monolingual out-of-domain (OOD) ASR system with a multilingual one to create a
subword-discriminative representation that is more language-independent. In the
second stage, segment-level k-means is adopted, and two methods to represent
the variable-length speech segments as fixed-dimension feature vectors are
compared. Experiments on a very low-resource Mboshi language corpus show that
our approach outperforms state-of-the-art AUD in both normalized mutual
information (NMI) and F-score. The multilingual ASR improved upon the
monolingual ASR in providing OOD phone labels and in estimating the phone
boundaries. A comparison of our systems with and without knowing the
ground-truth phone boundaries showed a 16% NMI performance gap, suggesting that
the current approach can significantly benefit from improved phone boundary
estimation.
Related papers
- Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - Learning Phone Recognition from Unpaired Audio and Phone Sequences Based
on Generative Adversarial Network [58.82343017711883]
This paper investigates how to learn directly from unpaired phone sequences and speech utterances.
GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence.
In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance.
arXiv Detail & Related papers (2022-07-29T09:29:28Z) - A Multi-level Supervised Contrastive Learning Framework for Low-Resource
Natural Language Inference [54.678516076366506]
Natural Language Inference (NLI) is a growingly essential task in natural language understanding.
Here we propose a multi-level supervised contrastive learning framework named MultiSCL for low-resource natural language inference.
arXiv Detail & Related papers (2022-05-31T05:54:18Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - A Hierarchical Model for Spoken Language Recognition [29.948719321162883]
Spoken language recognition ( SLR) refers to the automatic process used to determine the language present in a speech sample.
We propose a novel hierarchical approach were two PLDA models are trained, one to generate scores for clusters of highly related languages and a second one to generate scores conditional to each cluster.
We show that this hierarchical approach consistently outperforms the non-hierarchical one for detection of highly related languages.
arXiv Detail & Related papers (2022-01-04T22:10:36Z) - The effectiveness of unsupervised subword modeling with autoregressive
and cross-lingual phone-aware networks [36.24509775775634]
We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer.
Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies.
arXiv Detail & Related papers (2020-12-17T12:33:49Z) - Acoustics Based Intent Recognition Using Discovered Phonetic Units for
Low Resource Languages [51.0542215642794]
We propose a novel acoustics based intent recognition system that uses discovered phonetic units for intent classification.
We present results for two languages families - Indic languages and Romance languages, for two different intent recognition tasks.
arXiv Detail & Related papers (2020-11-07T00:35:31Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Streaming Language Identification using Combination of Acoustic
Representations and ASR Hypotheses [13.976935216584298]
A common approach to solve multilingual speech recognition is to run multiple monolingual ASR systems in parallel.
We propose an approach that learns and combines acoustic level representations with embeddings estimated on ASR hypotheses.
To reduce the processing cost and latency, we exploit a streaming architecture to identify the spoken language early.
arXiv Detail & Related papers (2020-06-01T04:08:55Z) - Phoneme Boundary Detection using Learnable Segmental Features [31.203969460341817]
Phoneme boundary detection plays an essential first step for a variety of speech processing applications.
We propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection.
arXiv Detail & Related papers (2020-02-11T14:03:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.