OCR-Enhanced Multimodal ASR Can Read While Listening
- URL: http://arxiv.org/abs/2601.18393v1
- Date: Mon, 26 Jan 2026 11:51:08 GMT
- Title: OCR-Enhanced Multimodal ASR Can Read While Listening
- Authors: Junli Chen, Changli Tang, Yixuan Li, Guangzhi Sun, Chao Zhang,
- Abstract summary: Donut-Whisper is an audio-visual ASR model with dual encoder to leverage visual information to improve speech recognition performance in both English and Chinese.<n>We propose a new multilingual audio-visual speech recognition dataset based on movie clips containing both Chinese and English partitions.<n> Donut-Whisper achieved significantly better performance on both English and Chinese partition of the dataset compared to both Donut and Whisper large V3 baselines.
- Score: 30.485215851341874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual information, such as subtitles in a movie, often helps automatic speech recognition. In this paper, we propose Donut-Whisper, an audio-visual ASR model with dual encoder to leverage visual information to improve speech recognition performance in both English and Chinese. Donut-Whisper combines the advantage of the linear and the Q-Former-based modality alignment structures via a cross-attention module, generating more powerful audio-visual features. Meanwhile, we propose a lightweight knowledge distillation scheme showcasing the potential of using audio-visual models to teach audio-only models to achieve better performance. Moreover, we propose a new multilingual audio-visual speech recognition dataset based on movie clips containing both Chinese and English partitions. As a result, Donut-Whisper achieved significantly better performance on both English and Chinese partition of the dataset compared to both Donut and Whisper large V3 baselines. In particular, an absolute 5.75% WER reduction and a 16.5% absolute CER reduction were achieved on the English and Chinese sets respectively compared to the Whisper ASR baseline.
Related papers
- mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition [31.824116269818564]
We propose mWhisper-Flamingo for multilingual Audio-Visual Speech Recognition.<n>It combines the strengths of a pre-trained audio model (Whisper) and video model (AV-HuBERT)<n>Audio-visual mWhisper-Flamingo consistently outperforms audio-only Whisper on all languages in noisy conditions.
arXiv Detail & Related papers (2025-02-03T17:29:52Z) - Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation [45.29184681700463]
Speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder.
We propose Whisper-Flamingo which integrates visual features into the Whisper speech recognition and translation model with gated cross attention.
Our models achieve state-of-the-art ASR WER (0.68%) and AVSR WER (0.76%) on LRS3, and state-of-the-art ASR WER (1.3%) and AVSR WER (1.4%) on LRS2.
arXiv Detail & Related papers (2024-06-14T14:36:54Z) - XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs.
XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation [58.72068260933836]
The input and output of the system are multimodal (i.e., audio and visual speech)
We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages.
In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech.
arXiv Detail & Related papers (2023-12-05T05:36:44Z) - AKVSR: Audio Knowledge Empowered Visual Speech Recognition by
Compressing Audio Knowledge of a Pretrained Model [53.492751392755636]
We propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality.
We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset.
arXiv Detail & Related papers (2023-08-15T06:38:38Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction
and Lip Reading [24.744371143092614]
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos.
We propose LipSound2, which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms.
arXiv Detail & Related papers (2021-12-09T08:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.