CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions
- URL: http://arxiv.org/abs/2408.16589v1
- Date: Thu, 29 Aug 2024 14:52:42 GMT
- Title: CrisperWhisper: Accurate Timestamps on Verbatim Speech Transcriptions
- Authors: Laurin Wagner, Bernhard Thallinger, Mario Zusag,
- Abstract summary: We fine-tune the model to produce more verbatim speech transcriptions.
We employ several techniques to increase robustness against multiple speakers and background noise.
- Score: 0.5120567378386615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We demonstrate that carefully adjusting the tokenizer of the Whisper speech recognition model significantly improves the precision of word-level timestamps when applying dynamic time warping to the decoder's cross-attention scores. We fine-tune the model to produce more verbatim speech transcriptions and employ several techniques to increase robustness against multiple speakers and background noise. These adjustments achieve state-of-the-art performance on benchmarks for verbatim speech transcription, word segmentation, and the timed detection of filler events, and can further mitigate transcription hallucinations. The code is available open https://github.com/nyrahealth/CrisperWhisper.
Related papers
- Continuous Speech Tokenizer in Text To Speech [27.057221389827735]
We propose a simple yet effective continuous speech tokenizer and a text-to-speech model based on continuous speech tokens.
Our results show that the speech language model based on the continuous speech tokenizer has better continuity and higher estimated Mean Opinion Scores (MoS)
This enhancement is attributed to better information preservation rate of the continuous speech tokenizer across both low and high frequencies in the frequency domain.
arXiv Detail & Related papers (2024-10-22T15:02:37Z) - Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - DASB -- Discrete Audio and Speech Benchmark [12.02056212008393]
We release the Discrete Audio and Speech Benchmark (DASB), a leaderboard for benchmarking discrete audio tokens across a range of tasks.
Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks.
However, the performance gap between semantic tokens and standard continuous representations remains substantial.
arXiv Detail & Related papers (2024-06-20T13:23:27Z) - A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - Improving Isochronous Machine Translation with Target Factors and
Auxiliary Counters [71.02335065794384]
We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences.
We show that our model improves translation quality and isochrony compared to previous work.
arXiv Detail & Related papers (2023-05-22T16:36:04Z) - Duration-aware pause insertion using pre-trained language model for
multi-speaker text-to-speech [40.65850332919397]
We propose more powerful pause insertion frameworks based on a pre-trained language model.
Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus.
We also leverage duration-aware pause insertion for more natural multi-speaker TTS.
arXiv Detail & Related papers (2023-02-27T10:40:41Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Topic Model Robustness to Automatic Speech Recognition Errors in Podcast
Transcripts [4.526933031343007]
In this work, we explore the robustness of a Latent Dirichlet Allocation topic model when applied to transcripts created by an automatic speech recognition engine.
First, we observe a baseline of cosine similarity scores between topic embeddings from automatic transcriptions and the descriptions of the podcasts written by the podcast creators.
We then observe how the cosine similarities decrease as transcription noise increases and conclude that even when automatic speech recognition transcripts are erroneous, it is still possible to obtain high-quality topic embeddings from the transcriptions.
arXiv Detail & Related papers (2021-09-25T07:59:31Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Attentional Speech Recognition Models Misbehave on Out-of-domain
Utterances [16.639133822656458]
We decode audio from the British National Corpus with an attentional encoder-decoder model trained solely on the LibriSpeech corpus.
We observe that there are many 5-second recordings that produce more than 500 characters of decoding output.
A frame-synchronous hybrid (DNN-HMM) model trained on the same data does not produce these unusually long transcripts.
arXiv Detail & Related papers (2020-02-12T18:53:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.