Related papers: Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

URL: http://arxiv.org/abs/2504.15066v1
Date: Mon, 21 Apr 2025 12:51:54 GMT
Title: Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides
Authors: Jinghua Zhao, Yuhang Jia, Shiyao Wang, Jiaming Zhou, Hui Wang, Yong Qin,
Abstract summary: We release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription.<n>We develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks.<n>Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8% and 25%, respectively.
Score: 12.148223089382816
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. Based on Chinese-LiPS, we develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks. Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8\% and 25\%, respectively, with a combined performance improvement of about 35\%. The dataset is available at https://kiri0824.github.io/Chinese-LiPS/

Related papers

Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio. Our framework learns tri-modal representations in a unified self-supervised transformer. Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z)
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs. XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z)
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing [56.71450690166821]
We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM) VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation. We show that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements.
arXiv Detail & Related papers (2024-02-23T07:21:32Z)
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. We propose a novel training strategy, processing with visual speech units. We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z)
SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition [20.476882754923047]
Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR) In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos.
arXiv Detail & Related papers (2024-01-18T07:19:10Z)
TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs. TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks. Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z)
LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading [24.744371143092614]
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2, which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms.
arXiv Detail & Related papers (2021-12-09T08:11:35Z)
Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions. We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z)
LiRA: Learning Visual Speech Representations from Audio through Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA) Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.