The effectiveness of unsupervised subword modeling with autoregressive
and cross-lingual phone-aware networks
- URL: http://arxiv.org/abs/2012.09544v2
- Date: Wed, 28 Apr 2021 09:50:15 GMT
- Title: The effectiveness of unsupervised subword modeling with autoregressive
and cross-lingual phone-aware networks
- Authors: Siyuan Feng, Odette Scharenborg
- Abstract summary: We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer.
Experiments on the ABX subword discriminability task conducted with the Libri-light and ZeroSpeech 2017 databases showed that our approach is competitive or superior to state-of-the-art studies.
- Score: 36.24509775775634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study addresses unsupervised subword modeling, i.e., learning acoustic
feature representations that can distinguish between subword units of a
language. We propose a two-stage learning framework that combines
self-supervised learning and cross-lingual knowledge transfer. The framework
consists of autoregressive predictive coding (APC) as the front-end and a
cross-lingual deep neural network (DNN) as the back-end. Experiments on the ABX
subword discriminability task conducted with the Libri-light and ZeroSpeech
2017 databases showed that our approach is competitive or superior to
state-of-the-art studies. Comprehensive and systematic analyses at the phoneme-
and articulatory feature (AF)-level showed that our approach was better at
capturing diphthong than monophthong vowel information, while also differences
in the amount of information captured for different types of consonants were
observed. Moreover, a positive correlation was found between the effectiveness
of the back-end in capturing a phoneme's information and the quality of the
cross-lingual phone labels assigned to the phoneme. The AF-level analysis
together with t-SNE visualization results showed that the proposed approach is
better than MFCC and APC features in capturing manner and place of articulation
information, vowel height, and backness information. Taken together, the
analyses showed that the two stages in our approach are both effective in
capturing phoneme and AF information. Nevertheless, monophthong vowel
information is less well captured than consonant information, which suggests
that future research should focus on improving capturing monophthong vowel
information.
Related papers
- Braille-to-Speech Generator: Audio Generation Based on Joint Fine-Tuning of CLIP and Fastspeech2 [0.0]
A set of image-to-speech framework CLIP-KNN-Fastspeech2 based on the Chinese context was constructed.
The framework integrates multiple basic models and adopts the strategy of independent pre-training and joint fine-tuning.
Experimental results on multiple public datasets show that the model has improved objective indicators such as BLEU4,FAD(Fr'echet Audio Distance), WER(Word Error Ratio), and even inference speed.
arXiv Detail & Related papers (2024-07-19T11:18:44Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE.
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z) - Text-Aware End-to-end Mispronunciation Detection and Diagnosis [17.286013739453796]
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT)
In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information.
arXiv Detail & Related papers (2022-06-15T04:08:10Z) - Unsupervised Multimodal Word Discovery based on Double Articulation
Analysis with Co-occurrence cues [7.332652485849632]
Human infants acquire their verbal lexicon with minimal prior knowledge of language.
This study proposes a novel fully unsupervised learning method for discovering speech units.
The proposed method can acquire words and phonemes from speech signals using unsupervised learning.
arXiv Detail & Related papers (2022-01-18T07:31:59Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Pre-training for Spoken Language Understanding with Joint Textual and
Phonetic Representation Learning [4.327558819000435]
We propose a novel joint textual-phonetic pre-training approach for learning spoken language representations.
Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models.
arXiv Detail & Related papers (2021-04-21T05:19:13Z) - Unsupervised Acoustic Unit Discovery by Leveraging a
Language-Independent Subword Discriminative Feature Representation [31.87235700253597]
This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data.
We propose a two-stage approach: the first stage learns a subword-discriminative feature representation and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units.
arXiv Detail & Related papers (2021-04-02T11:43:07Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.