Related papers: Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech

URL: http://arxiv.org/abs/2008.00702v1
Date: Mon, 3 Aug 2020 08:13:09 GMT
Title: Multimodal Semi-supervised Learning Framework for Punctuation Prediction in Conversational Speech
Authors: Monica Sunkara, Srikanth Ronanki, Dhanush Bekal, Sravan Bodapati, Katrin Kirchhoff
Abstract summary: We explore a multimodal semi-supervised learning approach for punctuation prediction. We learn representations from large amounts of unlabelled audio and text data. When trained on 1 hour of speech and text data, the proposed model achieved 9-18% absolute improvement over baseline model.
Score: 17.602098162338137
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we explore a multimodal semi-supervised learning approach for punctuation prediction by learning representations from large amounts of unlabelled audio and text data. Conventional approaches in speech processing typically use forced alignment to encoder per frame acoustic features to word level features and perform multimodal fusion of the resulting acoustic and lexical representations. As an alternative, we explore attention based multimodal fusion and compare its performance with forced alignment based fusion. Experiments conducted on the Fisher corpus show that our proposed approach achieves ~6-9% and ~3-4% absolute improvement (F1 score) over the baseline BLSTM model on reference transcripts and ASR outputs respectively. We further improve the model robustness to ASR errors by performing data augmentation with N-best lists which achieves up to an additional ~2-6% improvement on ASR outputs. We also demonstrate the effectiveness of semi-supervised learning approach by performing ablation study on various sizes of the corpus. When trained on 1 hour of speech and text data, the proposed model achieved ~9-18% absolute improvement over baseline model.

Related papers

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs) Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO) We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z)
On the N-gram Approximation of Pre-trained Language Models [17.764803904135903]
Large pre-trained language models (PLMs) have shown remarkable performance across various natural language understanding (NLU) tasks. This study investigates the potential usage of PLMs for language modelling in Automatic Speech Recognition (ASR) We compare the application of large-scale text sampling and probability conversion for approximating GPT-2 into an n-gram model.
arXiv Detail & Related papers (2023-06-12T06:42:08Z)
Improving Deliberation by Text-Only and Semi-Supervised Training [42.942428288428836]
We propose incorporating text-only and semi-supervised training into an attention-based deliberation model. We achieve 4%-12% WER reduction for various tasks compared to the baseline deliberation. We show that the deliberation model also achieves a positive human side-by-side evaluation.
arXiv Detail & Related papers (2022-06-29T15:30:44Z)
A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model. Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z)
A Comparative Study on Speaker-attributed Automatic Speech Recognition in Multi-party Meetings [53.120885867427305]
Three approaches are evaluated for speaker-attributed automatic speech recognition (SA-ASR) in a meeting scenario. The WD-SOT approach achieves 10.7% relative reduction on averaged speaker-dependent character error rate (SD-CER) The TS-ASR approach also outperforms the FD-SOT approach and brings 16.5% relative average SD-CER reduction.
arXiv Detail & Related papers (2022-03-31T06:39:14Z)
Representative Subset Selection for Efficient Fine-Tuning in Self-Supervised Speech Recognition [6.450618373898492]
We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR. We present the COWERAGE algorithm for representative subset selection in self-supervised ASR.
arXiv Detail & Related papers (2022-03-18T10:12:24Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
LDNet: Unified Listener Dependent Modeling in MOS Prediction for Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction. We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z)
Self-supervised Text-independent Speaker Verification using Prototypical Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning. We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.