Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous
Speech Recognition
- URL: http://arxiv.org/abs/2210.07771v1
- Date: Fri, 14 Oct 2022 13:01:00 GMT
- Title: Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous
Speech Recognition
- Authors: Jakob Poncelet, Hugo Van hamme
- Abstract summary: We propose a dual-decoder Transformer model that jointly performs ASR and automatic subtitling.
The model is trained to perform both tasks jointly, and is able to effectively use subtitle data.
- Score: 15.07442641083822
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: TV subtitles are a rich source of transcriptions of many types of speech,
ranging from read speech in news reports to conversational and spontaneous
speech in talk shows and soaps. However, subtitles are not verbatim (i.e.
exact) transcriptions of speech, so they cannot be used directly to improve an
Automatic Speech Recognition (ASR) model. We propose a multitask dual-decoder
Transformer model that jointly performs ASR and automatic subtitling. The ASR
decoder (possibly pre-trained) predicts the verbatim output and the subtitle
decoder generates a subtitle, while sharing the encoder. The two decoders can
be independent or connected. The model is trained to perform both tasks
jointly, and is able to effectively use subtitle data. We show improvements on
regular ASR and on spontaneous and conversational ASR by incorporating the
additional subtitle decoder. The method does not require preprocessing
(aligning, filtering, pseudo-labeling, ...) of the subtitles.
Related papers
- HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [77.02631712558251]
We propose to leverage the capability of large language models (LLMs) to obtain fine-grained video descriptions aligned with videos.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for text-video retrieval.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - TokenSplit: Using Discrete Speech Representations for Direct, Refined,
and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture.
We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning.
We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - Direct Speech Translation for Automatic Subtitling [17.095483965591267]
We propose the first direct ST model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model.
Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition.
arXiv Detail & Related papers (2022-09-27T06:47:42Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Aligning Subtitles in Sign Language Videos [80.20961722170655]
We train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video.
We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals.
Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle or not.
arXiv Detail & Related papers (2021-05-06T17:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.