EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild
- URL: http://arxiv.org/abs/2502.14892v1
- Date: Mon, 17 Feb 2025 04:47:12 GMT
- Title: EgoSpeak: Learning When to Speak for Egocentric Conversational Agents in the Wild
- Authors: Junhyeok Kim, Min Soo Kim, Jiwan Chung, Jungbin Cho, Jisoo Kim, Sungwoong Kim, Gyeongbo Sim, Youngjae Yu,
- Abstract summary: EgoSpeak is a novel framework for real-time speech initiation prediction in egocentric streaming video.<n>By modeling the conversation from the speaker's first-person viewpoint, EgoSpeak is tailored for human-like interactions.<n>EgoSpeak outperforms random and silence-based baselines in real time.
- Score: 20.84372784454967
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting when to initiate speech in real-world environments remains a fundamental challenge for conversational agents. We introduce EgoSpeak, a novel framework for real-time speech initiation prediction in egocentric streaming video. By modeling the conversation from the speaker's first-person viewpoint, EgoSpeak is tailored for human-like interactions in which a conversational agent must continuously observe its environment and dynamically decide when to talk. Our approach bridges the gap between simplified experimental setups and complex natural conversations by integrating four key capabilities: (1) first-person perspective, (2) RGB processing, (3) online processing, and (4) untrimmed video processing. We also present YT-Conversation, a diverse collection of in-the-wild conversational videos from YouTube, as a resource for large-scale pretraining. Experiments on EasyCom and Ego4D demonstrate that EgoSpeak outperforms random and silence-based baselines in real time. Our results also highlight the importance of multimodal input and context length in effectively deciding when to speak.
Related papers
- INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations [11.101103116878438]
We propose INFP, a novel audio-driven head generation framework for dyadic interaction.<n>INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage.<n>To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet.
arXiv Detail & Related papers (2024-12-05T10:20:34Z) - Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model.
It processes audio-visual speech from user input and generates audio-visual speech as the response.
We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z) - The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective [36.09288501153965]
We introduce the Ego-Exocentric Conversational Graph Prediction problem.
We propose a unified multi-modal framework -- Audio-Visual Conversational Attention (AV-CONV)
Specifically, we adopt the self-attention mechanism to model the representations across-time, across-subjects, and across-modalities.
arXiv Detail & Related papers (2023-12-20T09:34:22Z) - Interactive Conversational Head Generation [68.76774230274076]
We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation.
The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications.
arXiv Detail & Related papers (2023-07-05T08:06:26Z) - Channel-aware Decoupling Network for Multi-turn Dialogue Comprehension [81.47133615169203]
We propose compositional learning for holistic interaction across utterances beyond the sequential contextualization from PrLMs.
We employ domain-adaptive training strategies to help the model adapt to the dialogue domains.
Experimental results show that our method substantially boosts the strong PrLM baselines in four public benchmark datasets.
arXiv Detail & Related papers (2023-01-10T13:18:25Z) - DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video
Generation [54.84137342837465]
Face-to-face conversations account for the vast majority of daily conversations.
Most existing methods focused on single-person talking head generation.
We propose a novel unified framework based on neural radiance field (NeRF)
arXiv Detail & Related papers (2022-03-15T14:16:49Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z) - Intelligent Conversational Android ERICA Applied to Attentive Listening
and Job Interview [41.789773897391605]
We have developed an intelligent conversational android ERICA.
We set up several social interaction tasks for ERICA, including attentive listening, job interview, and speed dating.
It has been evaluated with 40 senior people, engaged in conversation of 5-7 minutes without a conversation breakdown.
arXiv Detail & Related papers (2021-05-02T06:37:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.