Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?
- URL: http://arxiv.org/abs/2409.09221v1
- Date: Fri, 13 Sep 2024 22:18:45 GMT
- Title: Multi-modal Speech Transformer Decoders: When Do Multiple Modalities Improve Accuracy?
- Authors: Yiwen Guan, Viet Anh Trinh, Vivek Voleti, Jacob Whitehill,
- Abstract summary: We investigate the effects of multiple modalities on recognition accuracy on both synthetic and real-world datasets.
Images as a supplementary modality for speech recognition provide the greatest benefit at moderate noise levels.
Performance improves on both synthetic and real-world datasets when the most relevant visual information is filtered as a preprocessing step.
- Score: 12.662031101992968
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Decoder-only discrete-token language models have recently achieved significant success in automatic speech recognition. However, systematic analyses of how different modalities impact performance in specific scenarios remain limited. In this paper, we investigate the effects of multiple modalities on recognition accuracy on both synthetic and real-world datasets. Our experiments suggest that: (1) Integrating more modalities can increase accuracy; in particular, our paper is, to our best knowledge, the first to show the benefit of combining audio, image context, and lip information; (2) Images as a supplementary modality for speech recognition provide the greatest benefit at moderate noise levels, moreover, they exhibit a different trend compared to inherently synchronized modalities like lip movements; (3) Performance improves on both synthetic and real-world datasets when the most relevant visual information is filtered as a preprocessing step.
Related papers
- RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Improving Code-Switching and Named Entity Recognition in ASR with Speech
Editing based Data Augmentation [22.38340990398735]
We propose a novel data augmentation method by applying the text-based speech editing model.
The experimental results on code-switching and NER tasks show that our proposed method can significantly outperform the audio splicing and neural TTS based data augmentation systems.
arXiv Detail & Related papers (2023-06-14T15:50:13Z) - Multi-Modal Multi-Correlation Learning for Audio-Visual Speech
Separation [38.75352529988137]
We propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation.
We define two key correlations which are: (1) identity correlation (between timbre and facial attributes); (2) phonetic correlation.
For implementation, contrastive learning or adversarial training approach is applied to maximize these two correlations.
arXiv Detail & Related papers (2022-07-04T04:53:39Z) - Exploiting Transformation Invariance and Equivariance for
Self-supervised Sound Localisation [32.68710772281511]
We present a self-supervised framework for audio-visual representation learning, to localize the sound source in videos.
Our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound.
This reveals the proposed framework learns strong multi-modal representations that are beneficial to sound localisation and generalization to further applications.
arXiv Detail & Related papers (2022-06-26T03:00:02Z) - High-Modality Multimodal Transformer: Quantifying Modality & Interaction
Heterogeneity for High-Modality Representation Learning [112.51498431119616]
This paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities.
A single model, HighMMT, scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas.
arXiv Detail & Related papers (2022-03-02T18:56:20Z) - Self-attention fusion for audiovisual emotion recognition with
incomplete data [103.70855797025689]
We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition.
We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
arXiv Detail & Related papers (2022-01-26T18:04:29Z) - Towards Realistic Visual Dubbing with Heterogeneous Sources [22.250010330418398]
Few-shot visual dubbing involves synchronizing the lip movements with arbitrary speech input for any talking head.
We propose a simple yet efficient two-stage framework with a higher flexibility of mining heterogeneous data.
Our method makes it possible to independently utilize the training corpus for two-stage sub-networks.
arXiv Detail & Related papers (2022-01-17T07:57:24Z) - Using growth transform dynamical systems for spatio-temporal data
sonification [9.721342507747158]
Sonification, or encoding information in meaningful audio signatures, has several advantages in augmenting or replacing traditional visualization methods for human-in-the-loop decision-making.
This paper presents a novel framework for sonifying high-dimensional data using a complex growth transform dynamical system model.
Our algorithm takes as input the data and optimization parameters underlying the learning or prediction task and combines it with the psycho parameters defined by the user.
arXiv Detail & Related papers (2021-08-21T16:25:59Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.