Related papers: ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition

ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition

URL: http://arxiv.org/abs/2506.04635v1
Date: Thu, 05 Jun 2025 05:13:01 GMT
Title: ViCocktail: Automated Multi-Modal Data Collection for Vietnamese Audio-Visual Speech Recognition
Authors: Thai-Binh Nguyen, Thi Van Nguyen, Quoc Truong Do, Chi Mai Luong,
Abstract summary: We present a practical approach to generate AVSR datasets from raw video.<n>We demonstrate its broad applicability by developing a baseline AVSR model for Vietnamese.
Score: 4.0048516930686535
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Audio-Visual Speech Recognition (AVSR) has gained significant attention recently due to its robustness against noise, which often challenges conventional speech recognition systems that rely solely on audio features. Despite this advantage, AVSR models remain limited by the scarcity of extensive datasets, especially for most languages beyond English. Automated data collection offers a promising solution. This work presents a practical approach to generate AVSR datasets from raw video, refining existing techniques for improved efficiency and accessibility. We demonstrate its broad applicability by developing a baseline AVSR model for Vietnamese. Experiments show the automatically collected dataset enables a strong baseline, achieving competitive performance with robust ASR in clean conditions and significantly outperforming them in noisy environments like cocktail parties. This efficient method provides a pathway to expand AVSR to more languages, particularly under-resourced ones.

Related papers

Cocktail-Party Audio-Visual Speech Recognition [58.222892601847924]
This study introduces a novel audio-visual cocktail-party dataset designed to benchmark current AVSR systems.<n>We contribute a 1526-hour AVSR dataset comprising both talking-face and silent-face segments, enabling significant performance gains in cocktail-party environments.<n>Our approach reduces WER by 67% relative to the state-of-the-art, reducing WER from 119% to 39.2% in extreme noise, without relying on explicit segmentation cues.
arXiv Detail & Related papers (2025-06-02T19:07:51Z)
VHASR: A Multimodal Speech Recognition System With Vision Hotwords [74.94430247036945]
VHASR is a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability. VHASR can effectively utilize key information in images to enhance the model's speech recognition ability.
arXiv Detail & Related papers (2024-10-01T16:06:02Z)
Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach [0.6445605125467574]
This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks. The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments. We propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training.
arXiv Detail & Related papers (2024-06-03T15:38:40Z)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets. Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
Iteratively Improving Speech Recognition and Voice Conversion [10.514009693947227]
We first train an ASR model which is used to ensure content preservation while training a VC model. In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers. By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models.
arXiv Detail & Related papers (2023-05-24T11:45:42Z)
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z)
CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR) It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z)
Weakly Supervised Construction of ASR Systems with Massive Video Data [18.5050375783871]
We present a weakly supervised framework for constructing ASR systems with massive video data. We propose an effective approach to extract high-quality audios aligned with transcripts from videos based on Optical Character Recognition (OCR) Our framework can easily produce state-of-the-art results on six public datasets for Mandarin speech recognition.
arXiv Detail & Related papers (2020-08-04T03:11:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.