Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition
- URL: http://arxiv.org/abs/2104.09106v1
- Date: Mon, 19 Apr 2021 07:54:15 GMT
- Title: Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition
- Authors: Wei Zhou, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schl\"uter, Hermann
Ney
- Abstract summary: Subword units are commonly used for end-to-end automatic speech recognition (ASR)
We propose an acoustic data-driven subword modeling approach that adapts the advantages of several text-based and acoustic-based subword methods into one pipeline.
- Score: 46.675712485821805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Subword units are commonly used for end-to-end automatic speech recognition
(ASR), while a fully acoustic-oriented subword modeling approach is somewhat
missing. We propose an acoustic data-driven subword modeling (ADSM) approach
that adapts the advantages of several text-based and acoustic-based subword
methods into one pipeline. With a fully acoustic-oriented label design and
learning process, ADSM produces acoustic-structured subword units and
acoustic-matched target sequence for further ASR training. The obtained ADSM
labels are evaluated with different end-to-end ASR approaches including CTC,
RNN-transducer and attention models. Experiments on the LibriSpeech corpus show
that ADSM clearly outperforms both byte pair encoding (BPE) and
pronunciation-assisted subword modeling (PASM) in all cases. Detailed analysis
shows that ADSM achieves acoustically more logical word segmentation and more
balanced sequence length, and thus, is suitable for both time-synchronous and
label-synchronous models. We also briefly describe how to apply acoustic-based
subword regularization and unseen text segmentation using ADSM.
Related papers
- High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Audio-to-Intent Using Acoustic-Textual Subword Representations from
End-to-End ASR [8.832255053182283]
We present a novel approach to predict the user's intent (the user speaking to the device or not) directly from acoustic and textual information encoded at subword tokens.
We show that our approach is highly accurate with correctly mitigating 93.3% of unintended user audio from invoking the smart assistant at 99% true positive rate.
arXiv Detail & Related papers (2022-10-21T17:45:00Z) - Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs.
The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference.
Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z) - Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR [77.82653227783447]
We propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network.
As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task.
arXiv Detail & Related papers (2022-03-01T05:02:02Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Analyzing autoencoder-based acoustic word embeddings [37.78342106714364]
Acoustic word embeddings (AWEs) are representations of words which encode their acoustic features.
We analyze basic properties of AWE spaces learned by a sequence-to-sequence encoder-decoder model in six typologically diverse languages.
AWEs exhibit a word onset bias, similar to patterns reported in various studies on human speech processing and lexical access.
arXiv Detail & Related papers (2020-04-03T16:11:57Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.