Multimodal Semi-supervised Learning Framework for Punctuation Prediction
in Conversational Speech
- URL: http://arxiv.org/abs/2008.00702v1
- Date: Mon, 3 Aug 2020 08:13:09 GMT
- Title: Multimodal Semi-supervised Learning Framework for Punctuation Prediction
in Conversational Speech
- Authors: Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati,
Katrin Kirchhoff
- Abstract summary: We explore a multimodal semi-supervised learning approach for punctuation prediction.
We learn representations from large amounts of unlabelled audio and text data.
When trained on 1 hour of speech and text data, the proposed model achieved 9-18% absolute improvement over baseline model.
- Score: 17.602098162338137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we explore a multimodal semi-supervised learning approach for
punctuation prediction by learning representations from large amounts of
unlabelled audio and text data. Conventional approaches in speech processing
typically use forced alignment to encoder per frame acoustic features to word
level features and perform multimodal fusion of the resulting acoustic and
lexical representations. As an alternative, we explore attention based
multimodal fusion and compare its performance with forced alignment based
fusion. Experiments conducted on the Fisher corpus show that our proposed
approach achieves ~6-9% and ~3-4% absolute improvement (F1 score) over the
baseline BLSTM model on reference transcripts and ASR outputs respectively. We
further improve the model robustness to ASR errors by performing data
augmentation with N-best lists which achieves up to an additional ~2-6%
improvement on ASR outputs. We also demonstrate the effectiveness of
semi-supervised learning approach by performing ablation study on various sizes
of the corpus. When trained on 1 hour of speech and text data, the proposed
model achieved ~9-18% absolute improvement over baseline model.
Related papers
- Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)
Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)
We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z) - On the N-gram Approximation of Pre-trained Language Models [17.764803904135903]
Large pre-trained language models (PLMs) have shown remarkable performance across various natural language understanding (NLU) tasks.
This study investigates the potential usage of PLMs for language modelling in Automatic Speech Recognition (ASR)
We compare the application of large-scale text sampling and probability conversion for approximating GPT-2 into an n-gram model.
arXiv Detail & Related papers (2023-06-12T06:42:08Z) - Improving Deliberation by Text-Only and Semi-Supervised Training [42.942428288428836]
We propose incorporating text-only and semi-supervised training into an attention-based deliberation model.
We achieve 4%-12% WER reduction for various tasks compared to the baseline deliberation.
We show that the deliberation model also achieves a positive human side-by-side evaluation.
arXiv Detail & Related papers (2022-06-29T15:30:44Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - A Comparative Study on Speaker-attributed Automatic Speech Recognition
in Multi-party Meetings [53.120885867427305]
Three approaches are evaluated for speaker-attributed automatic speech recognition (SA-ASR) in a meeting scenario.
The WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER)
The TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.
arXiv Detail & Related papers (2022-03-31T06:39:14Z) - Representative Subset Selection for Efficient Fine-Tuning in
Self-Supervised Speech Recognition [6.450618373898492]
We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR.
We present the COWERAGE algorithm for representative subset selection in self-supervised ASR.
arXiv Detail & Related papers (2022-03-18T10:12:24Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.