Improving Punctuation Restoration for Speech Transcripts via External
Data
- URL: http://arxiv.org/abs/2110.00560v1
- Date: Fri, 1 Oct 2021 17:40:55 GMT
- Title: Improving Punctuation Restoration for Speech Transcripts via External
Data
- Authors: Xue-Yong Fu, Cheng Chen, Md Tahmid Rahman Laskar, Shashi Bhushan TN,
Simon Corston-Oliver
- Abstract summary: We tackle the punctuation restoration problem specifically for the noisy text.
We introduce a data sampling technique based on an n-gram language model to sample more training data.
The proposed approach outperforms the baseline with an improvement of 1:12% F1 score.
- Score: 1.4335946386597276
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Automatic Speech Recognition (ASR) systems generally do not produce
punctuated transcripts. To make transcripts more readable and follow the
expected input format for downstream language models, it is necessary to add
punctuation marks. In this paper, we tackle the punctuation restoration problem
specifically for the noisy text (e.g., phone conversation scenarios). To
leverage the available written text datasets, we introduce a data sampling
technique based on an n-gram language model to sample more training data that
are similar to our in-domain data. Moreover, we propose a two-stage fine-tuning
approach that utilizes the sampled external data as well as our in-domain
dataset for models based on BERT. Extensive experiments show that the proposed
approach outperforms the baseline with an improvement of 1:12% F1 score.
Related papers
- Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Using External Off-Policy Speech-To-Text Mappings in Contextual
End-To-End Automated Speech Recognition [19.489794740679024]
We investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods.
In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR.
Experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours.
arXiv Detail & Related papers (2023-01-06T22:32:50Z) - Speech-text based multi-modal training with bidirectional attention for
improved speech recognition [26.47071418582507]
We propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method.
BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space.
Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
arXiv Detail & Related papers (2022-11-01T08:25:11Z) - Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text.
We build a new model that can better learn video-span correlation without manually designed features.
Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR [10.261890123213622]
We propose an on-the-fly data augmentation method for automatic speech recognition (ASR)
Our method, called Aligned Data Augmentation (ADA) for ASR, replaces transcribed tokens and the speech representations in an aligned manner to generate training pairs.
arXiv Detail & Related papers (2021-04-03T13:00:00Z) - Neural Data-to-Text Generation with LM-based Text Augmentation [27.822282190362856]
We show that a weakly supervised training paradigm is able to outperform fully supervised seq2seq models with less than 10% annotations.
By utilizing all annotated data, our model can boost the performance of a standard seq2seq model by over 5 BLEU points.
arXiv Detail & Related papers (2021-02-06T10:21:48Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - Leverage Unlabeled Data for Abstractive Speech Summarization with
Self-Supervised Learning and Back-Summarization [6.465251961564605]
Supervised approaches for Neural Abstractive Summarization require large annotated corpora that are costly to build.
We present a French meeting summarization task where reports are predicted based on the automatic transcription of the meeting audio recordings.
We report large improvements compared to the previous baseline for both approaches on two evaluation sets.
arXiv Detail & Related papers (2020-07-30T08:22:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.