Improving Medical Speech-to-Text Accuracy with Vision-Language
Pre-training Model
- URL: http://arxiv.org/abs/2303.00091v1
- Date: Mon, 27 Feb 2023 08:06:04 GMT
- Title: Improving Medical Speech-to-Text Accuracy with Vision-Language
Pre-training Model
- Authors: Jaeyoung Huh, Sangjoon Park, Jeong Eun Lee, Jong Chul Ye
- Abstract summary: Speech-To-Text (STT) has the potential to significantly reduce the workload of clinicians who rely on typists to transcribe their voice recordings.
We propose a medical-domain text correction method that modifies the output text of a general STT system.
Our experiments demonstrate that the proposed method offers quantitatively and clinically significant improvements in STT performance in the medical field.
- Score: 36.9873998348851
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automatic Speech Recognition (ASR) is a technology that converts spoken words
into text, facilitating interaction between humans and machines. One of the
most common applications of ASR is Speech-To-Text (STT) technology, which
simplifies user workflows by transcribing spoken words into text. In the
medical field, STT has the potential to significantly reduce the workload of
clinicians who rely on typists to transcribe their voice recordings. However,
developing an STT model for the medical domain is challenging due to the lack
of sufficient speech and text datasets. To address this issue, we propose a
medical-domain text correction method that modifies the output text of a
general STT system using the Vision Language Pre-training (VLP) method. VLP
combines textual and visual information to correct text based on image
knowledge. Our extensive experiments demonstrate that the proposed method
offers quantitatively and clinically significant improvements in STT
performance in the medical field. We further show that multi-modal
understanding of image and text information outperforms single-modal
understanding using only text information.
Related papers
- SpeechCT-CLIP: Distilling Text-Image Knowledge to Speech for Voice-Native Multimodal CT Analysis [33.90335501244261]
We train a contrastive model that aligns speech and 3D CT volumes in a shared representation space.<n>Experiments demonstrate improved zero-shot classification F1 from 0.623 to 0.705, recovering 88% of the performance difference.<n>These findings highlight speech as a practical alternative to text in multimodal pretraining and open the door to voice-driven diagnostic support tools in clinical practice.
arXiv Detail & Related papers (2025-09-24T15:17:21Z) - Grammar Induction from Visual, Speech and Text [91.98797120799227]
This work introduces a novel visual-audio-text grammar induction task (textbfVAT-GI)
Inspired by the fact that language grammar exists beyond the texts, we argue that the text has not to be the predominant modality in grammar induction.
We propose a visual-audio-text inside-outside autoencoder (textbfVaTiora) framework, which leverages rich modal-specific and complementary features for effective grammar parsing.
arXiv Detail & Related papers (2024-10-01T02:24:18Z) - An efficient text augmentation approach for contextualized Mandarin speech recognition [4.600045052545344]
Our study proposes to leverage extensive text-only datasets and contextualize pre-trained ASR models.
To contextualize a pre-trained CIF-based ASR, we construct a codebook using limited speech-text data.
Our experiments on diverse Mandarin test sets demonstrate that our TA approach significantly boosts recognition performance.
arXiv Detail & Related papers (2024-06-14T11:53:14Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Text-Aware End-to-end Mispronunciation Detection and Diagnosis [17.286013739453796]
Mispronunciation detection and diagnosis (MDD) technology is a key component of computer-assisted pronunciation training system (CAPT)
In this paper, we present a gating strategy that assigns more importance to the relevant audio features while suppressing irrelevant text information.
arXiv Detail & Related papers (2022-06-15T04:08:10Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.