"Notic My Speech" -- Blending Speech Patterns With Multimedia
- URL: http://arxiv.org/abs/2006.08599v1
- Date: Fri, 12 Jun 2020 06:51:55 GMT
- Title: "Notic My Speech" -- Blending Speech Patterns With Multimedia
- Authors: Dhruva Sahrawat, Yaman Kumar, Shashwat Aggarwal, Yifang Yin, Rajiv
Ratn Shah and Roger Zimmermann
- Abstract summary: We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
- Score: 65.91370924641862
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech as a natural signal is composed of three parts - visemes (visual part
of speech), phonemes (spoken part of speech), and language (the imposed
structure). However, video as a medium for the delivery of speech and a
multimedia construct has mostly ignored the cognitive aspects of speech
delivery. For example, video applications like transcoding and compression have
till now ignored the fact how speech is delivered and heard. To close the gap
between speech understanding and multimedia video applications, in this paper,
we show the initial experiments by modelling the perception on visual speech
and showing its use case on video compression. On the other hand, in the visual
speech recognition domain, existing studies have mostly modeled it as a
classification problem, while ignoring the correlations between views,
phonemes, visemes, and speech perception. This results in solutions which are
further away from how human perception works. To bridge this gap, we propose a
view-temporal attention mechanism to model both the view dependence and the
visemic importance in speech recognition and understanding. We conduct
experiments on three public visual speech recognition datasets. The
experimental results show that our proposed method outperformed the existing
work by 4.99% in terms of the viseme error rate. Moreover, we show that there
is a strong correlation between our model's understanding of multi-view speech
and the human perception. This characteristic benefits downstream applications
such as video compression and streaming where a significant number of less
important frames can be compressed or eliminated while being able to maximally
preserve human speech understanding with good user experience.
Related papers
- Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech [29.510756530126837]
We introduce a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech.
We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data.
arXiv Detail & Related papers (2024-09-23T20:19:24Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Speech2Video: Cross-Modal Distillation for Speech to Video Generation [21.757776580641902]
Speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries.
The challenge mainly lies in disentangling the distinct visual attributes from audio signals.
We propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs.
arXiv Detail & Related papers (2021-07-10T10:27:26Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - AudioViewer: Learning to Visualize Sound [12.71759722609666]
We aim to create sound perception for hearing impaired people, for instance, to facilitate feedback for training deaf speech.
Our design is to translate from audio to video by compressing both into a common latent space with shared structure.
arXiv Detail & Related papers (2020-12-22T21:52:45Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.