WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
- URL: http://arxiv.org/abs/2505.09558v1
- Date: Wed, 14 May 2025 16:54:15 GMT
- Title: WavReward: Spoken Dialogue Models With Generalist Reward Evaluators
- Authors: Shengpeng Ji, Tianle Liang, Yangzhuo Li, Jialong Zuo, Minghui Fang, Jinzheng He, Yifu Chen, Zhengqing Liu, Ziyue Jiang, Xize Cheng, Siqi Zheng, Jin Xu, Junyang Lin, Zhou Zhao,
- Abstract summary: WavReward is a reward feedback model that can evaluate IQ and EQ of spoken dialogue systems with speech input.<n>ChatReward-30K is a preference dataset used to train WavReward.<n>WavReward outperforms previous state-of-the-art evaluation models across multiple spoken dialogue scenarios.
- Score: 57.80264359636158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end spoken dialogue models such as GPT-4o-audio have recently garnered significant attention in the speech domain. However, the evaluation of spoken dialogue models' conversational performance has largely been overlooked. This is primarily due to the intelligent chatbots convey a wealth of non-textual information which cannot be easily measured using text-based language models like ChatGPT. To address this gap, we propose WavReward, a reward feedback model based on audio language models that can evaluate both the IQ and EQ of spoken dialogue systems with speech input. Specifically, 1) based on audio language models, WavReward incorporates the deep reasoning process and the nonlinear reward mechanism for post-training. By utilizing multi-sample feedback via the reinforcement learning algorithm, we construct a specialized evaluator tailored to spoken dialogue models. 2) We introduce ChatReward-30K, a preference dataset used to train WavReward. ChatReward-30K includes both comprehension and generation aspects of spoken dialogue models. These scenarios span various tasks, such as text-based chats, nine acoustic attributes of instruction chats, and implicit chats. WavReward outperforms previous state-of-the-art evaluation models across multiple spoken dialogue scenarios, achieving a substantial improvement about Qwen2.5-Omni in objective accuracy from 55.1$\%$ to 91.5$\%$. In subjective A/B testing, WavReward also leads by a margin of 83$\%$. Comprehensive ablation studies confirm the necessity of each component of WavReward. All data and code will be publicly at https://github.com/jishengpeng/WavReward after the paper is accepted.
Related papers
- ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching [22.477986192421767]
We introduce ZipVoice-Dialog, a non-autoregressive spoken dialogue generation model built upon flow matching.<n>Key designs include speaker-turn embeddings for precise speaker turn-taking.<n>We curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data.
arXiv Detail & Related papers (2025-07-12T15:18:47Z) - WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - GODEL: Large-Scale Pre-Training for Goal-Directed Dialog [119.1397031992088]
We introduce GODEL, a large pre-trained language model for dialog.
We show that GODEL outperforms state-of-the-art pre-trained dialog models in few-shot fine-tuning setups.
A novel feature of our evaluation methodology is the introduction of a notion of utility that assesses the usefulness of responses.
arXiv Detail & Related papers (2022-06-22T18:19:32Z) - CloneBot: Personalized Dialogue-Response Predictions [0.0]
The project task was to create a model that, given a speaker ID, chat history, and an utterance query, can predict the response utterance in a conversation.
The model is personalized for each speaker. This task can be a useful tool for building speech bots that talk in a human-like manner in a live conversation.
arXiv Detail & Related papers (2021-03-31T01:15:37Z) - The Adapter-Bot: All-In-One Controllable Conversational Model [66.48164003532484]
We propose a dialogue model that uses a fixed backbone model such as DialGPT and triggers on-demand dialogue skills via different adapters.
Depending on the skills, the model is able to process multiple knowledge types, such as text, tables, and emphatic responses.
We evaluate our model using automatic evaluation by comparing it with existing state-of-the-art conversational models.
arXiv Detail & Related papers (2020-08-28T10:59:31Z) - Speaker Sensitive Response Evaluation Model [17.381658875470638]
We propose an automatic evaluation model based on the similarity of the generated response with the conversational context.
We learn the model parameters from an unlabeled conversation corpus.
We show that our model can be applied to movie dialogues without any additional training.
arXiv Detail & Related papers (2020-06-12T08:59:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.