SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
- URL: http://arxiv.org/abs/2401.09759v2
- Date: Tue, 2 Jul 2024 13:43:59 GMT
- Title: SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
- Authors: Hao Wang, Shuhei Kurita, Shuichiro Shimizu, Daisuke Kawahara,
- Abstract summary: Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR)
In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos.
- Score: 20.476882754923047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.
Related papers
- Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides [12.148223089382816]
We release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription.
We develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks.
Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8% and 25%, respectively.
arXiv Detail & Related papers (2025-04-21T12:51:54Z) - Listening and Seeing Again: Generative Error Correction for Audio-Visual Speech Recognition [39.206005299985605]
We propose a novel GER paradigm for AVSR, termed AVGER, that follows the concept of listening and seeing again''
The proposed AVGER can reduce Word Error Rate (WER) by 24% compared to current mainstream AVSR systems.
arXiv Detail & Related papers (2025-01-03T10:51:14Z) - VHASR: A Multimodal Speech Recognition System With Vision Hotwords [74.94430247036945]
VHASR is a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability.
VHASR can effectively utilize key information in images to enhance the model's speech recognition ability.
arXiv Detail & Related papers (2024-10-01T16:06:02Z) - Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models [21.85677682584916]
speculative speech recognition (SSR)
We propose a model which does SSR by combining a RNN-Transducer-based ASR system with an audio-ed language model (LM)
arXiv Detail & Related papers (2024-07-05T16:52:55Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality
Alignment [57.15449072423539]
We propose a training system Open-modality Speech Recognition (textbfOpenSR)
OpenSR enables modality transfer from one to any in three different settings.
It achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods.
arXiv Detail & Related papers (2023-06-10T11:04:10Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - Weakly Supervised Construction of ASR Systems with Massive Video Data [18.5050375783871]
We present a weakly supervised framework for constructing ASR systems with massive video data.
We propose an effective approach to extract high-quality audios aligned with transcripts from videos based on Optical Character Recognition (OCR)
Our framework can easily produce state-of-the-art results on six public datasets for Mandarin speech recognition.
arXiv Detail & Related papers (2020-08-04T03:11:32Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z) - Contextualizing ASR Lattice Rescoring with Hybrid Pointer Network
Language Model [26.78064626111014]
In building automatic speech recognition systems, we can exploit the contextual information provided by video metadata.
We first use an attention based method to extract contextual vector representations of video metadata, and use these representations as part of the inputs to a neural language model.
Secondly, we propose a hybrid pointer network approach to explicitly interpolate the word probabilities of the word occurrences in metadata.
arXiv Detail & Related papers (2020-05-15T07:47:33Z) - Looking Enhances Listening: Recovering Missing Speech Using Images [40.616935661628155]
We present a set of experiments where we show the utility of the visual modality under noisy conditions.
Our results show that multimodal ASR models can recover words which are masked in the input acoustic signal, by grounding its transcriptions using the visual representations.
arXiv Detail & Related papers (2020-02-13T17:12:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.