End-to-End Speech Recognition and Disfluency Removal with Acoustic
Language Model Pretraining
- URL: http://arxiv.org/abs/2309.04516v1
- Date: Fri, 8 Sep 2023 17:12:14 GMT
- Title: End-to-End Speech Recognition and Disfluency Removal with Acoustic
Language Model Pretraining
- Authors: Saksham Bassi, Giulio Duregon, Siddhartha Jalagam, David Roth
- Abstract summary: We revisit the performance comparison between two-stage and end-to-end model.
We find that audio based language models pretrained using weak self-supervised objectives match or exceed the performance of similarly trained two-stage models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The SOTA in transcription of disfluent and conversational speech has in
recent years favored two-stage models, with separate transcription and cleaning
stages. We believe that previous attempts at end-to-end disfluency removal have
fallen short because of the representational advantage that large-scale
language model pretraining has given to lexical models. Until recently, the
high dimensionality and limited availability of large audio datasets inhibited
the development of large-scale self-supervised pretraining objectives for
learning effective audio representations, giving a relative advantage to the
two-stage approach, which utilises pretrained representations for lexical
tokens. In light of recent successes in large scale audio pretraining, we
revisit the performance comparison between two-stage and end-to-end model and
find that audio based language models pretrained using weak self-supervised
objectives match or exceed the performance of similarly trained two-stage
models, and further, that the choice of pretraining objective substantially
effects a model's ability to be adapted to the disfluency removal task.
Related papers
- Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis [33.909582975045545]
We propose a phonetic enhanced language modeling method to improve the performance of TTS models.
We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model.
arXiv Detail & Related papers (2024-06-04T06:43:34Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - CTAL: Pre-training Cross-modal Transformer for Audio-and-Language
Representations [20.239063010740853]
We present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language.
We observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification.
arXiv Detail & Related papers (2021-09-01T04:18:19Z) - Streaming end-to-end speech recognition with jointly trained neural
feature enhancement [20.86554979122057]
We present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers.
We introduce two training strategies: Gradual Application of Enhanced Features (GAEF) and Gradual Reduction of Enhanced Loss (GREL)
arXiv Detail & Related papers (2021-05-04T02:25:41Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z) - Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models
via Continual Learning [74.25168207651376]
Fine-tuning pre-trained language models to downstream cross-lingual tasks has shown promising results.
We leverage continual learning to preserve the cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks.
Our methods achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
arXiv Detail & Related papers (2020-04-29T14:07:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.