OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs
- URL: http://arxiv.org/abs/2503.21480v2
- Date: Fri, 28 Mar 2025 12:34:25 GMT
- Title: OmniVox: Zero-Shot Emotion Recognition with Omni-LLMs
- Authors: John Murzaku, Owen Rambow,
- Abstract summary: We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task.<n>We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models.<n>We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning.
- Score: 6.365802395342737
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The use of omni-LLMs (large language models that accept any modality as input), particularly for multimodal cognitive state tasks involving speech, is understudied. We present OmniVox, the first systematic evaluation of four omni-LLMs on the zero-shot emotion recognition task. We evaluate on two widely used multimodal emotion benchmarks: IEMOCAP and MELD, and find zero-shot omni-LLMs outperform or are competitive with fine-tuned audio models. Alongside our audio-only evaluation, we also evaluate omni-LLMs on text only and text and audio. We present acoustic prompting, an audio-specific prompting strategy for omni-LLMs which focuses on acoustic feature analysis, conversation context analysis, and step-by-step reasoning. We compare our acoustic prompting to minimal prompting and full chain-of-thought prompting techniques. We perform a context window analysis on IEMOCAP and MELD, and find that using context helps, especially on IEMOCAP. We conclude with an error analysis on the generated acoustic reasoning outputs from the omni-LLMs.
Related papers
- ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems [57.806797579986075]
We introduce an open-source, user-friendly toolkit to build unified web interfaces for various cascaded and E2E spoken dialogue systems.<n>Using the evaluation metrics, we compare various cascaded and E2E spoken dialogue systems with a human-human conversation dataset as a proxy.<n>Our analysis demonstrates that the toolkit allows researchers to effortlessly compare and contrast different technologies.
arXiv Detail & Related papers (2025-03-11T15:24:02Z) - VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models [32.086847480051084]
We present VoxEval, a novel SpeechQA benchmark that assesses knowledge understanding through pure speech interactions.<n>Our benchmark 1) maintains speech format for both inputs and outputs, 2) evaluates model robustness across diverse input audio conditions, and 3) pioneers the assessment of complex tasks like mathematical reasoning in spoken format.
arXiv Detail & Related papers (2025-01-09T04:30:12Z) - OmniSep: Unified Omni-Modality Sound Separation with Query-Mixup [50.70494796172493]
We introduce Omni-modal Sound Separation (OmniSep), a novel framework capable of isolating clean soundtracks based on omni-modal queries.
We introduce the Query-Mixup strategy, which blends query features from different modalities during training.
We further enhance this flexibility by allowing queries to influence sound separation positively or negatively, facilitating the retention or removal of specific sounds.
arXiv Detail & Related papers (2024-10-28T17:58:15Z) - OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities [124.05360767047539]
We introduce OmnixR, an evaluation suite designed to benchmark SoTA Omni-modality Language Models.
evaluating OLMs, which integrate multiple modalities such as text, vision, and audio, presents unique challenges.
Our experiments find that all state-of-the-art OLMs struggle with OmnixR questions that require integrating information from multiple modalities to answer.
arXiv Detail & Related papers (2024-10-16T04:29:46Z) - Where are we in audio deepfake detection? A systematic analysis over generative and detection models [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.<n>It provides a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.<n>It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Beyond Single-Audio: Advancing Multi-Audio Processing in Audio Large Language Models [56.776580717999806]
Real-world applications often involve processing multiple audio streams simultaneously.
We propose the first multi-audio evaluation benchmark that consists of 20 datasets from 11 multi-audio tasks.
We propose a novel multi-audio-LLM (MALLM) to capture audio context among multiple similar audios.
arXiv Detail & Related papers (2024-09-27T12:06:53Z) - AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension [95.8442896569132]
We introduce AIR-Bench, the first benchmark to evaluate the ability of Large Audio-Language Models (LALMs) to understand various types of audio signals and interact with humans in the textual format.
Results demonstrate a high level of consistency between GPT-4-based evaluation and human evaluation.
arXiv Detail & Related papers (2024-02-12T15:41:22Z) - Speech Emotion Recognition with Co-Attention based Multi-level Acoustic
Information [21.527784717450885]
Speech Emotion Recognition aims to help the machine to understand human's subjective emotion from only audio information.
We propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module.
arXiv Detail & Related papers (2022-03-29T08:17:28Z) - Joint Speech Recognition and Audio Captioning [37.205642807313545]
Speech samples recorded in both indoor and outdoor environments are often contaminated with secondary audio sources.
We aim to bring together the growing field of automated audio captioning (AAC) and the thoroughly studied automatic speech recognition (ASR)
We propose several approaches for end-to-end joint modeling of ASR and AAC tasks.
arXiv Detail & Related papers (2022-02-03T04:42:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.