Multimodal Semi-supervised Learning Framework for Punctuation Prediction
in Conversational Speech
- URL: http://arxiv.org/abs/2008.00702v1
- Date: Mon, 3 Aug 2020 08:13:09 GMT
- Title: Multimodal Semi-supervised Learning Framework for Punctuation Prediction
in Conversational Speech
- Authors: Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati,
Katrin Kirchhoff
- Abstract summary: We explore a multimodal semi-supervised learning approach for punctuation prediction.
We learn representations from large amounts of unlabelled audio and text data.
When trained on 1 hour of speech and text data, the proposed model achieved 9-18% absolute improvement over baseline model.
- Score: 17.602098162338137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we explore a multimodal semi-supervised learning approach for
punctuation prediction by learning representations from large amounts of
unlabelled audio and text data. Conventional approaches in speech processing
typically use forced alignment to encoder per frame acoustic features to word
level features and perform multimodal fusion of the resulting acoustic and
lexical representations. As an alternative, we explore attention based
multimodal fusion and compare its performance with forced alignment based
fusion. Experiments conducted on the Fisher corpus show that our proposed
approach achieves ~6-9% and ~3-4% absolute improvement (F1 score) over the
baseline BLSTM model on reference transcripts and ASR outputs respectively. We
further improve the model robustness to ASR errors by performing data
augmentation with N-best lists which achieves up to an additional ~2-6%
improvement on ASR outputs. We also demonstrate the effectiveness of
semi-supervised learning approach by performing ablation study on various sizes
of the corpus. When trained on 1 hour of speech and text data, the proposed
model achieved ~9-18% absolute improvement over baseline model.
Related papers
- On the N-gram Approximation of Pre-trained Language Models [17.764803904135903]
Large pre-trained language models (PLMs) have shown remarkable performance across various natural language understanding (NLU) tasks.
This study investigates the potential usage of PLMs for language modelling in Automatic Speech Recognition (ASR)
We compare the application of large-scale text sampling and probability conversion for approximating GPT-2 into an n-gram model.
arXiv Detail & Related papers (2023-06-12T06:42:08Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Improving Deliberation by Text-Only and Semi-Supervised Training [42.942428288428836]
We propose incorporating text-only and semi-supervised training into an attention-based deliberation model.
We achieve 4%-12% WER reduction for various tasks compared to the baseline deliberation.
We show that the deliberation model also achieves a positive human side-by-side evaluation.
arXiv Detail & Related papers (2022-06-29T15:30:44Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - A Comparative Study on Speaker-attributed Automatic Speech Recognition
in Multi-party Meetings [53.120885867427305]
Three approaches are evaluated for speaker-attributed automatic speech recognition (SA-ASR) in a meeting scenario.
The WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER)
The TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.
arXiv Detail & Related papers (2022-03-31T06:39:14Z) - Representative Subset Selection for Efficient Fine-Tuning in
Self-Supervised Speech Recognition [6.450618373898492]
We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR.
We present the COWERAGE algorithm for representative subset selection in self-supervised ASR.
arXiv Detail & Related papers (2022-03-18T10:12:24Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.