Related papers: OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

URL: http://arxiv.org/abs/2505.21724v2
Date: Tue, 28 Oct 2025 14:26:23 GMT
Title: OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions
Authors: Cheng Luo, Jianghui Wang, Bing Li, Siyang Song, Bernard Ghanem,
Abstract summary: Online Multimodal Conversational Response Generation (OMCRG) is a novel task designed to produce synchronized verbal and non-verbal listener feedback online.<n>We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses.<n>We offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors.
Score: 62.19092662469285
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task designed to produce synchronized verbal and non-verbal listener feedback online, based on the speaker's multimodal inputs. OMCRG captures natural dyadic interactions and introduces new challenges in aligning generated audio with listeners' facial responses. To tackle these challenges, we incorporate text as an intermediate modality to connect audio and facial responses. We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses. OmniResponse leverages a pretrained LLM enhanced with two core components: Chrono-Text Markup, which precisely timestamps generated text tokens, and TempoVoice, a controllable online text-to-speech (TTS) module that outputs speech synchronized with facial responses. To advance OMCRG research, we offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors. Comprehensive evaluations on ResponseNet demonstrate that OmniResponse outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality. Our dataset, code, and models are publicly available.

Related papers

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching [79.0241611035794]
CoVoMix2 is a framework for zero-shot multi-talker dialogue generation.<n>It predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model.<n>Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed.
arXiv Detail & Related papers (2025-06-01T07:51:45Z)
Beyond Words: Multimodal LLM Knows When to Speak [25.374878759869333]
We focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text.<n>We introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams.<n>We propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate.
arXiv Detail & Related papers (2025-05-20T17:42:34Z)
OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication [19.688375369516923]
We introduce an end-to-end unified framework that simultaneously generates synchronized speech and talking head videos from text and reference video in real-time zero-shot scenarios.<n>Our method surpasses existing approaches in generation quality, particularly excelling in style preservation and audio-video synchronization.
arXiv Detail & Related papers (2025-04-03T09:48:13Z)
SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation [17.56310064245171]
SALMON-omni is a speech understanding and generation model capable of simultaneously listening to its own generated speech sounds while speaking.<n> SALMON-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full- conversational AI systems.
arXiv Detail & Related papers (2024-11-27T08:38:57Z)
OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z)
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities. Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences. We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z)
Recent Advances in Speech Language Models: A Survey [45.968078636811356]
Speech Language Models (SpeechLMs) are end-to-end models that generate speech without converting from text.<n>This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs.
arXiv Detail & Related papers (2024-10-01T21:48:12Z)
Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech. We show how Moshi Moshi can provide streaming speech recognition and text-to-speech. Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z)
PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems [7.326036800127981]
Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. generating a spoken response requires the prior generation of a written response, and speech sequences are significantly longer than text sequences. This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech.
arXiv Detail & Related papers (2024-06-18T09:23:54Z)
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X) NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z)
HeterMPC: A Heterogeneous Graph Neural Network for Response Generation in Multi-Party Conversations [76.64792382097724]
We present HeterMPC, a graph-based neural network for response generation in multi-party conversations (MPCs) HeterMPC models the semantics of utterances and interlocutors simultaneously with two types of nodes in a graph. Through multi-hop updating, HeterMPC can adequately utilize the structural knowledge of conversations for response generation.
arXiv Detail & Related papers (2022-03-16T09:50:32Z)
Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs. Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.