Joint Audio and Speech Understanding
- URL: http://arxiv.org/abs/2309.14405v3
- Date: Sun, 10 Dec 2023 22:50:21 GMT
- Title: Joint Audio and Speech Understanding
- Authors: Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, James
Glass
- Abstract summary: We build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perception and advanced reasoning ability.
By integrating Whisper as a perception module and LLaMA as a reasoning module, LTU-AS can simultaneously recognize and jointly understand spoken text, speech paralinguistics, and non-speech audio events.
- Score: 81.34673662385774
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans are surrounded by audio signals that include both speech and
non-speech sounds. The recognition and understanding of speech and non-speech
audio events, along with a profound comprehension of the relationship between
them, constitute fundamental cognitive capabilities. For the first time, we
build a machine learning model, called LTU-AS, that has a conceptually similar
universal audio perception and advanced reasoning ability. Specifically, by
integrating Whisper as a perception module and LLaMA as a reasoning module,
LTU-AS can simultaneously recognize and jointly understand spoken text, speech
paralinguistics, and non-speech audio events - almost everything perceivable
from audio signals.
Related papers
- What Are They Doing? Joint Audio-Speech Co-Reasoning [10.957451368533302]
Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model.
We introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing.
We establish a joint audio-speech benchmark to evaluate the joint reasoning capability of popular ALLMs.
arXiv Detail & Related papers (2024-09-22T16:45:57Z) - SALMONN: Towards Generic Hearing Abilities for Large Language Models [24.73033723114979]
We propose SALMONN, a speech audio language music open neural network.
It is built by integrating a pre-trained text-based large language model (LLM) with speech and audio encoders into a single multimodal model.
It is the first model of its type and can be regarded as a step towards AI with generic hearing abilities.
arXiv Detail & Related papers (2023-10-20T05:41:57Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Speech inpainting: Context-based speech synthesis guided by video [29.233167442719676]
This paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment.
We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio.
We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech.
arXiv Detail & Related papers (2023-06-01T09:40:47Z) - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking
Head [82.69233563811487]
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition.
We propose a multi-modal AI system named AudioGPT, which complements LLMs with foundation models to process complex audio information.
arXiv Detail & Related papers (2023-04-25T17:05:38Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.