Related papers: ViSpeR: Multilingual Audio-Visual Speech Recognition

ViSpeR: Multilingual Audio-Visual Speech Recognition

URL: http://arxiv.org/abs/2406.00038v1
Date: Mon, 27 May 2024 14:48:51 GMT
Title: ViSpeR: Multilingual Audio-Visual Speech Recognition
Authors: Sanath Narayan, Yasser Abdelaziz Dahou Djilali, Ankit Singh, Eustache Le Bihan, Hakim Hacid,
Abstract summary: This work presents an extensive and detailed study on Audio-Visual Speech Recognition for five widely spoken languages. We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models. Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language.
Score: 9.40993779729177
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work presents an extensive and detailed study on Audio-Visual Speech Recognition (AVSR) for five widely spoken languages: Chinese, Spanish, English, Arabic, and French. We have collected large-scale datasets for each language except for English, and have engaged in the training of supervised learning models. Our model, ViSpeR, is trained in a multi-lingual setting, resulting in competitive performance on newly established benchmarks for each language. The datasets and models are released to the community with an aim to serve as a foundation for triggering and feeding further research work and exploration on Audio-Visual Speech Recognition, an increasingly important area of research. Code available at \href{https://github.com/YasserdahouML/visper}{https://github.com/YasserdahouML/visper}.

Related papers

Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations [65.59784436914548]
We introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. We convert the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC)
arXiv Detail & Related papers (2025-03-08T16:40:13Z)
Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models [13.855545744177586]
This paper evaluates audio language models on Thai, a low-resource language, and finds that they lack emergent cross-lingual abilities despite their multilingual foundations.<n>Our experiments provide insights into improving instruction-following in low-resource languages by balancing language-specific and multilingual training data.<n>The proposed model, Typhoon-Audio, significantly outperforms existing open-source models and achieves performance comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai.
arXiv Detail & Related papers (2024-09-17T09:04:03Z)
Teach me with a Whisper: Enhancing Large Language Models for Analyzing Spoken Transcripts using Speech Embeddings [8.660203441911554]
We propose a methodology for training language models leveraging spoken language audio data. This leads to an improved language model for analyzing spoken transcripts while avoiding an audio processing overhead at test time. In our experiments, the student model achieves consistent improvement over traditional language models on tasks analyzing spoken transcripts.
arXiv Detail & Related papers (2023-11-13T01:53:12Z)
CL-MASR: A Continual Learning Benchmark for Multilingual ASR [15.974765568276615]
We propose CL-MASR, a benchmark for studying multilingual automatic speech recognition in a continual learning setting. CL-MASR provides a diverse set of continual learning methods implemented on top of large-scale pretrained ASR models, along with common metrics. To the best of our knowledge, CL-MASR is the first continual learning benchmark for the multilingual ASR task.
arXiv Detail & Related papers (2023-10-25T18:55:40Z)
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge [57.38948190611797]
This paper proposes a novel lip reading framework, especially for low-resource languages. Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
arXiv Detail & Related papers (2023-08-18T05:19:03Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset [53.46019570679092]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation. VALOR jointly models relationships of vision, audio and language in an end-to-end manner. It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z)
Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years. We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data. Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z)
Discovering Phonetic Inventories with Crosslingual Automatic Speech Recognition [71.49308685090324]
This paper investigates the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language. We find that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
arXiv Detail & Related papers (2022-01-26T22:12:55Z)
Multilingual Jointly Trained Acoustic and Written Word Embeddings [22.63696520064212]
We extend this idea to multiple low-resource languages. We jointly train an AWE model and an AGWE model, using phonetically transcribed data from multiple languages. The pre-trained models can then be used for unseen zero-resource languages, or fine-tuned on data from low-resource languages.
arXiv Detail & Related papers (2020-06-24T19:16:02Z)
That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.