Related papers: The ICASSP 2026 HumDial Challenge: Benchmarking Human-like Spoken Dialogue Systems in the LLM Era

The ICASSP 2026 HumDial Challenge: Benchmarking Human-like Spoken Dialogue Systems in the LLM Era

URL: http://arxiv.org/abs/2601.05564v1
Date: Fri, 09 Jan 2026 06:32:30 GMT
Title: The ICASSP 2026 HumDial Challenge: Benchmarking Human-like Spoken Dialogue Systems in the LLM Era
Authors: Zhixian Zhao, Shuiyuan Wang, Guojian Li, Hongfei Xue, Chengyou Wang, Shuai Wang, Longshuai Xiao, Zihan Zhang, Hui Bu, Xin Xu, Xinsheng Wang, Hexin Liu, Eng Siong Chng, Hung-yi Lee, Haizhou Li, Lei Xie,
Abstract summary: We launch the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026.<n>This paper summarizes the dataset, track configurations, and the final results.
Score: 95.35748535806744
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Driven by the rapid advancement of Large Language Models (LLMs), particularly Audio-LLMs and Omni-models, spoken dialogue systems have evolved significantly, progressively narrowing the gap between human-machine and human-human interactions. Achieving truly ``human-like'' communication necessitates a dual capability: emotional intelligence to perceive and resonate with users' emotional states, and robust interaction mechanisms to navigate the dynamic, natural flow of conversation, such as real-time turn-taking. Therefore, we launched the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026 to benchmark these dual capabilities. Anchored by a sizable dataset derived from authentic human conversations, this initiative establishes a fair evaluation platform across two tracks: (1) Emotional Intelligence, targeting long-term emotion understanding and empathetic generation; and (2) Full-Duplex Interaction, systematically evaluating real-time decision-making under `` listening-while-speaking'' conditions. This paper summarizes the dataset, track configurations, and the final results.

Related papers

Decoding Ambiguous Emotions with Test-Time Scaling in Audio-Language Models [18.059483722792077]
We introduce the first benchmark for ambiguous emotion recognition in speech with ALMs under test-time scaling.<n>Our evaluation systematically compares eight state-of-the-art ALMs and five TTS strategies across three prominent speech emotion datasets.<n>Our benchmark establishes a foundation for developing more robust, context-aware, and emotionally intelligent speech-based AI systems.
arXiv Detail & Related papers (2026-02-01T07:41:57Z)
A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction [50.05919688888947]
This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT)<n>IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision.<n> Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation.
arXiv Detail & Related papers (2026-01-08T14:07:30Z)
Chronological Thinking in Full-Duplex Spoken Dialogue Language Models [66.84843878538207]
Chronological Thinking aims to improve response quality in full SDLMs.<n>No additional latency: once the user stops speaking, the agent halts thinking and begins speaking without further delay.<n>Results: Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations.
arXiv Detail & Related papers (2025-10-02T10:28:11Z)
FLEXI: Benchmarking Full-duplex Human-LLM Speech Interaction [49.83226596963294]
Speech-computer human interaction enables real-time spoken dialogue systems.<n>Modelling and benchmarking these models remains a fundamental challenge.<n>We introduce FLEXI, the first benchmark for full-human spoken interaction.
arXiv Detail & Related papers (2025-09-26T11:57:42Z)
Are You Listening to Me? Fine-Tuning Chatbots for Empathetic Dialogue [0.5849783371898033]
We explore how Large Language Models (LLMs) respond when tasked with generating emotionally rich interactions.<n>We analyzed the emotional progression of the dialogues using both sentiment analysis (via VADER) and expert assessments.
arXiv Detail & Related papers (2025-07-03T11:32:41Z)
Full-Duplex-Bench: A Benchmark to Evaluate Full-duplex Spoken Dialogue Models on Turn-taking Capabilities [93.09944267871163]
FullDuplexBench is a benchmark that systematically evaluates key interactive behaviors.<n>By releasing our benchmark code we aim to advance spoken dialogue modeling and the development of more natural and engaging SDMs.
arXiv Detail & Related papers (2025-03-06T18:59:16Z)
REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation [51.97224538045096]
We introduce REALTALK, a 21-day corpus of authentic messaging app dialogues.<n>We compare EI attributes and persona consistency to understand the challenges posed by real-world dialogues.<n>Our findings reveal that models struggle to simulate a user solely from dialogue history, while fine-tuning on specific user chats improves persona emulation.
arXiv Detail & Related papers (2025-02-18T20:29:01Z)
CAPE: A Chinese Dataset for Appraisal-based Emotional Generation using Large Language Models [30.40159858361768]
We introduce a two-stage automatic data generation framework to create CAPE, a Chinese dataset named Cognitive Appraisal theory-based Emotional corpus. This corpus facilitates the generation of dialogues with contextually appropriate emotional responses by accounting for diverse personal and situational factors. Our study shows the potential for advancing emotional expression in conversational agents, paving the way for more nuanced and meaningful human-computer interactions.
arXiv Detail & Related papers (2024-10-18T03:33:18Z)
Intelligent Conversational Android ERICA Applied to Attentive Listening and Job Interview [41.789773897391605]
We have developed an intelligent conversational android ERICA. We set up several social interaction tasks for ERICA, including attentive listening, job interview, and speed dating. It has been evaluated with 40 senior people, engaged in conversation of 5-7 minutes without a conversation breakdown.
arXiv Detail & Related papers (2021-05-02T06:37:23Z)
Retrieval Augmentation Reduces Hallucination in Conversation [49.35235945543833]
We explore the use of neural-retrieval-in-the-loop architectures for knowledge-grounded dialogue. We show that our best models obtain state-of-the-art performance on two knowledge-grounded conversational tasks.
arXiv Detail & Related papers (2021-04-15T16:24:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.