Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection
- URL: http://arxiv.org/abs/2505.23627v1
- Date: Thu, 29 May 2025 16:34:47 GMT
- Title: Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection
- Authors: Griffin Dietz Smith, Dianna Yee, Jennifer King Chen, Leah Findlater,
- Abstract summary: We propose an end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection.<n>We conducted two case studies -- children's read-aloud and adult atypical speech -- and found that our proposed strategies improve verbatim transcription and miscue detection compared to current state-of-the-art.
- Score: 7.650371454756065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Identifying mistakes (i.e., miscues) made while reading aloud is commonly approached post-hoc by comparing automatic speech recognition (ASR) transcriptions to the target reading text. However, post-hoc methods perform poorly when ASR inaccurately transcribes verbatim speech. To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection. Our contributions include: first, demonstrating that incorporating reading text through prompting benefits verbatim transcription performance over fine-tuning, and second, showing that it is feasible to augment speech recognition tasks for end-to-end miscue detection. We conducted two case studies -- children's read-aloud and adult atypical speech -- and found that our proposed strategies improve verbatim transcription and miscue detection compared to current state-of-the-art.
Related papers
- Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation [12.571782794778182]
Chain-of-Thought (CoT) prompting has been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues.<n>We find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech.<n>Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution.
arXiv Detail & Related papers (2025-10-03T15:42:38Z) - Refining Transcripts With TV Subtitles by Prompt-Based Weakly Supervised Training of ASR [15.311893064721858]
This study proposes a novel approach to using TV subtitles within a weakly supervised (WS) Automatic Speech Recognition (ASR) framework.<n>Rather than using subtitles as direct supervision signals, our method reimagines them as context-rich prompts.<n> generated pseudo transcripts become the primary targets, with subtitles acting as guiding cues for iterative refinement.
arXiv Detail & Related papers (2025-09-01T11:43:07Z) - Detecting the Undetectable: Assessing the Efficacy of Current Spoof Detection Methods Against Seamless Speech Edits [82.8859060022651]
We introduce the Speech INfilling Edit (SINE) dataset, created with Voicebox.<n>Subjective evaluations confirm that speech edited using this novel technique is more challenging to detect than conventional cut-and-paste methods.<n>Despite human difficulty, experimental results demonstrate that self-supervised-based detectors can achieve remarkable performance in detection, localization, and generalization.
arXiv Detail & Related papers (2025-01-07T14:17:47Z) - Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
In this article, we tackle the challenge of developing ASR systems without paired speech and text corpora.<n>We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.<n>This innovative model surpasses the performance of previous unsupervised ASR models under the lexicon-free setting.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Can Authorship Attribution Models Distinguish Speakers in Speech Transcripts? [4.148732457277201]
Authorship verification is the task of determining if two distinct writing samples share the same author.<n>In this paper, we explore the attribution of transcribed speech, which poses novel challenges.<n>We propose a new benchmark for speaker attribution focused on human-transcribed conversational speech transcripts.
arXiv Detail & Related papers (2023-11-13T18:54:17Z) - Boosting Punctuation Restoration with Data Generation and Reinforcement
Learning [70.26450819702728]
Punctuation restoration is an important task in automatic speech recognition (ASR)
The discrepancy between written punctuated texts and ASR texts limits the usability of written texts in training punctuation restoration systems for ASR texts.
This paper proposes a reinforcement learning method to exploit in-topic written texts and recent advances in large pre-trained generative language models to bridge this gap.
arXiv Detail & Related papers (2023-07-24T17:22:04Z) - Looking and Listening: Audio Guided Text Recognition [62.98768236858089]
Text recognition in the wild is a long-standing problem in computer vision.
Recent studies suggest vision and language processing are effective for scene text recognition.
Yet, solving edit errors such as add, delete, or replace is still the main challenge for existing approaches.
We propose the AudioOCR, a simple yet effective probabilistic audio decoder for mel spectrogram sequence prediction.
arXiv Detail & Related papers (2023-06-06T08:08:18Z) - Weakly-supervised forced alignment of disfluent speech using
phoneme-level modeling [10.283092375534311]
We propose a simple and effective modification of alignment graph construction using weighted Finite State Transducers.
The proposed weakly-supervised approach alleviates the need for verbatim transcription of speech disfluencies for forced alignment.
Our evaluation on a corrupted version of the TIMIT test set and the UCLASS dataset shows significant improvements.
arXiv Detail & Related papers (2023-05-30T09:57:36Z) - Text-Aware End-to-end Mispronunciation Detection and Diagnosis [17.286013739453796]
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT)
In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information.
arXiv Detail & Related papers (2022-06-15T04:08:10Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Spoken Term Detection Methods for Sparse Transcription in Very
Low-resource Settings [20.410074074340447]
Experiments on two oral languages show that a pretrained universal phone recognizer, fine-tuned with only a few minutes of target language speech, can be used for spoken term detection.
We show that representing phoneme recognition ambiguity in a graph structure can further boost the recall while maintaining high precision in the low resource spoken term detection task.
arXiv Detail & Related papers (2021-06-11T04:09:54Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.