OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality
Alignment
- URL: http://arxiv.org/abs/2306.06410v1
- Date: Sat, 10 Jun 2023 11:04:10 GMT
- Title: OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality
Alignment
- Authors: Xize Cheng, Tao Jin, Linjun Li, Wang Lin, Xinyu Duan and Zhou Zhao
- Abstract summary: We propose a training system Open-modality Speech Recognition (textbfOpenSR)
OpenSR enables modality transfer from one to any in three different settings.
It achieves highly competitive zero-shot performance compared to the existing few-shot and full-shot lip-reading methods.
- Score: 57.15449072423539
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speech Recognition builds a bridge between the multimedia streaming
(audio-only, visual-only or audio-visual) and the corresponding text
transcription. However, when training the specific model of new domain, it
often gets stuck in the lack of new-domain utterances, especially the labeled
visual utterances. To break through this restriction, we attempt to achieve
zero-shot modality transfer by maintaining the multi-modality alignment in
phoneme space learned with unlabeled multimedia utterances in the high resource
domain during the pre-training \cite{shi2022learning}, and propose a training
system Open-modality Speech Recognition (\textbf{OpenSR}) that enables the
models trained on a single modality (e.g., audio-only) applicable to more
modalities (e.g., visual-only and audio-visual). Furthermore, we employ a
cluster-based prompt tuning strategy to handle the domain shift for the
scenarios with only common words in the new domain utterances. We demonstrate
that OpenSR enables modality transfer from one to any in three different
settings (zero-, few- and full-shot), and achieves highly competitive zero-shot
performance compared to the existing few-shot and full-shot lip-reading
methods. To the best of our knowledge, OpenSR achieves the state-of-the-art
performance of word error rate in LRS2 on audio-visual speech recognition and
lip-reading with 2.7\% and 25.0\%, respectively. The code and demo are
available at https://github.com/Exgc/OpenSR.
Related papers
- Large Language Models Are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.
We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.
We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.81% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z) - Audio-visual Generalized Zero-shot Learning the Easy Way [20.60905505473906]
We introduce EZ-AVGZL, which aligns audio-visual embeddings with transformed text representations.
We conduct extensive experiments on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks.
arXiv Detail & Related papers (2024-07-18T01:57:16Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Lip2Vec: Efficient and Robust Visual Speech Recognition via
Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model.
The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER.
We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - AVGZSLNet: Audio-Visual Generalized Zero-Shot Learning by Reconstructing
Label Features from Multi-Modal Embeddings [37.3282534461213]
We propose a novel approach for generalized zero-shot learning in a multi-modal setting.
We use the semantic relatedness of text embeddings as a means for zero-shot learning by aligning audio and video embeddings with the corresponding class label text feature space.
arXiv Detail & Related papers (2020-05-27T14:58:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.