Related papers: Audio-Aware Large Language Models as Judges for Speaking Styles

Audio-Aware Large Language Models as Judges for Speaking Styles

URL: http://arxiv.org/abs/2506.05984v1
Date: Fri, 06 Jun 2025 11:05:48 GMT
Title: Audio-Aware Large Language Models as Judges for Speaking Styles
Authors: Cheng-Han Chiang, Xiaofei Wang, Chung-Ching Lin, Kevin Lin, Linjie Li, Radu Kopetz, Yao Qian, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang,
Abstract summary: We explore using audio-aware large language models (ALLMs) as an automatic judge to assess the speaking styles of speeches.<n>We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses.<n>Our results show that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.
Score: 123.36224336701237
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-aware large language models (ALLMs) can understand the textual and non-textual information in the audio input. In this paper, we explore using ALLMs as an automatic judge to assess the speaking styles of speeches. We use ALLM judges to evaluate the speeches generated by SLMs on two tasks: voice style instruction following and role-playing. The speaking style we consider includes emotion, volume, speaking pace, word emphasis, pitch control, and non-verbal elements. We use four spoken language models (SLMs) to complete the two tasks and use humans and ALLMs to judge the SLMs' responses. We compare two ALLM judges, GPT-4o-audio and Gemini-2.5-pro, with human evaluation results and show that the agreement between Gemini and human judges is comparable to the agreement between human evaluators. These promising results show that ALLMs can be used as a judge to evaluate SLMs. Our results also reveal that current SLMs, even GPT-4o-audio, still have room for improvement in controlling the speaking style and generating natural dialogues.

Related papers

Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation [17.330188045948663]
We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges.<n>This task involves a unique set of cognitive abilities that have previously received limited attention in systematic benchmarking.<n>We leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task.
arXiv Detail & Related papers (2025-06-05T14:06:51Z)
Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models [58.43486430996411]
Large Audio-Language Models (LALMs) have unclocked audio dialogue capabilities, where audio dialogues are a direct exchange of spoken language between LALMs and humans.<n>Recent advances, such as GPT-4o, have enabled LALMs in back-and-forth audio dialogues with humans.<n>We propose an Audio Dialogue Understanding Benchmark (ADU-Bench) to evaluate the performance of LALMs in the open-ended audio dialogue understanding.
arXiv Detail & Related papers (2024-12-06T16:34:15Z)
Beyond Silent Letters: Amplifying LLMs in Emotion Recognition with Vocal Nuances [3.396456345114466]
We propose SpeechCueLLM, a method that translates speech characteristics into natural language descriptions.<n>We evaluate SpeechCueLLM on two datasets: IEMOCAP and MELD, showing significant improvements in emotion recognition accuracy.
arXiv Detail & Related papers (2024-07-31T03:53:14Z)
Can LLMs Understand the Implication of Emphasized Sentences in Dialogue? [64.72966061510375]
Emphasis is a crucial component in human communication, which indicates the speaker's intention and implication beyond pure text in dialogue. This paper introduces Emphasized-Talk, a benchmark with emphasis-annotated dialogue samples capturing the implications of emphasis. We evaluate various Large Language Models (LLMs), both open-source and commercial, to measure their performance in understanding emphasis.
arXiv Detail & Related papers (2024-06-16T20:41:44Z)
Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations [65.29513437838457]
Even if two current turns are the same sentence, their responses might still differ when they are spoken in different styles. We propose Spoken-LLM framework that can model the linguistic content and the speaking styles. We train Spoken-LLM using the StyleTalk dataset and devise a two-stage training pipeline to help the Spoken-LLM better learn the speaking styles.
arXiv Detail & Related papers (2024-02-20T07:51:43Z)
BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues [72.65163468440434]
This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting. We prompt large language models (LLMs) to generate a full multi-turn dialogue based on the ChatSEED, utterance by utterance. We find GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts.
arXiv Detail & Related papers (2023-10-20T16:53:51Z)
Toward Joint Language Modeling for Speech Units and Text [89.32163954508489]
We explore joint language modeling for speech units and text. We introduce automatic metrics to evaluate how well the joint LM mixes speech and text. Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks.
arXiv Detail & Related papers (2023-10-12T20:53:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.