Related papers: What Are They Doing? Joint Audio-Speech Co-Reasoning

What Are They Doing? Joint Audio-Speech Co-Reasoning

URL: http://arxiv.org/abs/2409.14526v1
Date: Sun, 22 Sep 2024 16:45:57 GMT
Title: What Are They Doing? Joint Audio-Speech Co-Reasoning
Authors: Yingzhi Wang, Pooneh Mousavi, Artem Ploujnikov, Mirco Ravanelli,
Abstract summary: Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model. We introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing. We establish a joint audio-speech benchmark to evaluate the joint reasoning capability of popular ALLMs.
Score: 10.957451368533302
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In audio and speech processing, tasks usually focus on either the audio or speech modality, even when both sounds and human speech are present in the same audio clip. Recent Auditory Large Language Models (ALLMs) have made it possible to process audio and speech simultaneously within a single model, leading to further considerations of joint audio-speech tasks. In this paper, we investigate how well ALLMs can perform joint audio-speech processing. Specifically, we introduce Joint Audio-Speech Co-Reasoning (JASCO), a novel task that unifies audio and speech processing, strictly requiring co-reasoning across both modalities. We release a scene-reasoning dataset called "What Are They Doing" and establish a joint audio-speech benchmark to evaluate the joint reasoning capability of popular ALLMs. Additionally, we provide deeper insights into the models' behaviors by analyzing their dependence on each modality.

Related papers

Step-Audio 2 Technical Report [108.04129284951314]
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.<n>By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding.
arXiv Detail & Related papers (2025-07-22T14:23:55Z)
USAD: Universal Speech and Audio Representation via Distillation [56.91647396619358]
Universal Speech and Audio Distillation (USAD) is a unified approach to audio representation learning.<n>USAD integrates diverse audio types - speech, sound, and music - into a single model.
arXiv Detail & Related papers (2025-06-23T17:02:00Z)
Seeing Speech and Sound: Distinguishing and Locating Audios in Visual Scenes [16.530816405275715]
We present a unified model capable of simultaneously grounding both spoken language and non-speech sounds within a visual scene. Existing approaches are typically limited to handling either speech or non-speech sounds independently, or at best, together but sequentially without mixing.
arXiv Detail & Related papers (2025-03-24T16:56:04Z)
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction [9.101978573666546]
Baichuan-Audio is an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities.
arXiv Detail & Related papers (2025-02-24T15:16:34Z)
Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously. We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks. We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z)
AudioChatLlama: Towards General-Purpose Speech Abilities for LLMs [27.122094554340194]
We extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities. The resulting end-to-end model, named AudioChatLlama, can utilize audio prompts as a replacement for text and sustain a conversation.
arXiv Detail & Related papers (2023-11-12T06:56:14Z)
Joint Audio and Speech Understanding [81.34673662385774]
We build a machine learning model, called LTU-AS, that has a conceptually similar universal audio perception and advanced reasoning ability. By integrating Whisper as a perception module and LLaMA as a reasoning module, LTU-AS can simultaneously recognize and jointly understand spoken text, speech paralinguistics, and non-speech audio events.
arXiv Detail & Related papers (2023-09-25T17:59:05Z)
Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA) AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z)
Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning. We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z)
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head [82.69233563811487]
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. We propose a multi-modal AI system named AudioGPT, which complements LLMs with foundation models to process complex audio information.
arXiv Detail & Related papers (2023-04-25T17:05:38Z)
AudioLM: a Language Modeling Approach to Audio Generation [59.19364975706805]
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure. We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
arXiv Detail & Related papers (2022-09-07T13:40:08Z)
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR. Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z)
Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources. We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR) We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z)
VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video. It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.