Related papers: AV-Dialog: Spoken Dialogue Models with Audio-Visual Input

AV-Dialog: Spoken Dialogue Models with Audio-Visual Input

URL: http://arxiv.org/abs/2511.11124v1
Date: Fri, 14 Nov 2025 09:56:26 GMT
Title: AV-Dialog: Spoken Dialogue Models with Audio-Visual Input
Authors: Tuochao Chen, Bandhav Veluri, Hongyu Gong, Shyamnath Gollakota,
Abstract summary: We present AV-Dialog, the first framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses.<n>Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality.<n>These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for spoken dialogue agents that perform robustly in real-world, noisy environments.
Score: 16.289812372606168
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.

Related papers

Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z)
Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction [12.216811577733125]
We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns.<n>We introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking.<n>We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid audio-native agentic and human-in-the-loop pipeline.
arXiv Detail & Related papers (2025-12-16T19:26:44Z)
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching [22.477986192421767]
We introduce ZipVoice-Dialog, a non-autoregressive spoken dialogue generation model built upon flow matching.<n>Key designs include speaker-turn embeddings for precise speaker turn-taking.<n>We curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data.
arXiv Detail & Related papers (2025-07-12T15:18:47Z)
Aligning Spoken Dialogue Models from User Interactions [55.192134724622235]
We propose a novel preference alignment framework to improve spoken dialogue models on realtime conversations from user interactions.<n>We create a dataset of more than 150,000 preference pairs from raw multi-turn speech conversations annotated with AI feedback.<n>Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
arXiv Detail & Related papers (2025-06-26T16:45:20Z)
CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching [78.01028753403575]
CoVoMix2 is a framework for zero-shot multi-talker dialogue generation.<n>It predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model.<n>Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed.
arXiv Detail & Related papers (2025-06-01T07:51:45Z)
Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response. We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z)
Multimodal Contextual Dialogue Breakdown Detection for Conversational AI Models [1.4199474167684119]
We introduce a Multimodal Contextual Dialogue Breakdown (MultConDB) model. This model significantly outperforms other known best models by achieving an F1 of 69.27.
arXiv Detail & Related papers (2024-04-11T23:09:18Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
A Speaker-aware Parallel Hierarchical Attentive Encoder-Decoder Model for Multi-turn Dialogue Generation [13.820298189734686]
This paper presents a novel open-domain dialogue generation model emphasizing the differentiation of speakers in multi-turn conversations. Our empirical results show that PHAED outperforms the state-of-the-art in both automatic and human evaluations.
arXiv Detail & Related papers (2021-10-13T16:08:29Z)
Enhanced Speaker-aware Multi-party Multi-turn Dialogue Comprehension [43.352833140317486]
Multi-party multi-turn dialogue comprehension brings unprecedented challenges. Most existing methods deal with dialogue contexts as plain texts. We propose an enhanced speaker-aware model with masking attention and heterogeneous graph networks.
arXiv Detail & Related papers (2021-09-09T07:12:22Z)
Filling the Gap of Utterance-aware and Speaker-aware Representation for Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles. In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely. We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.