Phoneme Boundary Detection using Learnable Segmental Features
- URL: http://arxiv.org/abs/2002.04992v2
- Date: Sun, 16 Feb 2020 07:26:42 GMT
- Title: Phoneme Boundary Detection using Learnable Segmental Features
- Authors: Felix Kreuk, Yaniv Sheena, Joseph Keshet, and Yossi Adi
- Abstract summary: Phoneme boundary detection plays an essential first step for a variety of speech processing applications.
We propose a neural architecture coupled with a parameterized structured loss function to learn segmental representations for the task of phoneme boundary detection.
- Score: 31.203969460341817
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Phoneme boundary detection plays an essential first step for a variety of
speech processing applications such as speaker diarization, speech science,
keyword spotting, etc. In this work, we propose a neural architecture coupled
with a parameterized structured loss function to learn segmental
representations for the task of phoneme boundary detection. First, we evaluated
our model when the spoken phonemes were not given as input. Results on the
TIMIT and Buckeye corpora suggest that the proposed model is superior to the
baseline models and reaches state-of-the-art performance in terms of F1 and
R-value. We further explore the use of phonetic transcription as additional
supervision and show this yields minor improvements in performance but
substantially better convergence rates. We additionally evaluate the model on a
Hebrew corpus and demonstrate such phonetic supervision can be beneficial in a
multi-lingual setting.
Related papers
- Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody Modeling [39.80957479349776]
We investigate the prosody modeling capabilities of the discrete space of an RVQ-VAE model, modifying it to operate on the phoneme-level.
We show that the phoneme-level discrete latent representations achieves a high degree of disentanglement, capturing fine-grained prosodic information that is robust and transferable.
arXiv Detail & Related papers (2024-09-13T09:27:05Z) - REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR [54.64158282822995]
We propose REBORN,Reinforcement-Learned Boundary with Iterative Training for Unsupervised ASR.
ReBORN alternates between training a segmentation model that predicts the boundaries of the segmental structures in speech signals and training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription.
We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech.
arXiv Detail & Related papers (2024-02-06T13:26:19Z) - Text-Aware End-to-end Mispronunciation Detection and Diagnosis [17.286013739453796]
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT)
In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information.
arXiv Detail & Related papers (2022-06-15T04:08:10Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Unsupervised Acoustic Unit Discovery by Leveraging a
Language-Independent Subword Discriminative Feature Representation [31.87235700253597]
This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data.
We propose a two-stage approach: the first stage learns a subword-discriminative feature representation and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units.
arXiv Detail & Related papers (2021-04-02T11:43:07Z) - Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition [41.92991390542083]
We present a simple, novel and competitive approach for phoneme-based neural transducer modeling.
A phonetic context size of one is shown to be sufficient for the best performance.
The overall performance of our best model is comparable to state-of-the-art (SOTA) results for the TED-LIUM Release 2 and Switchboard corpora.
arXiv Detail & Related papers (2020-10-30T16:53:29Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Catplayinginthesnow: Impact of Prior Segmentation on a Model of Visually
Grounded Speech [24.187382590960254]
Children do not build their lexicon by segmenting spoken input into phonemes and then building up words from them.
This suggests that the ideal way of learning a language is by starting from full semantic units.
We present a simple way to introduce such information into an RNN-based model and investigate which type of boundary is the most efficient.
arXiv Detail & Related papers (2020-06-15T13:20:13Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.