Proactive Hearing Assistants that Isolate Egocentric Conversations
- URL: http://arxiv.org/abs/2511.11473v1
- Date: Fri, 14 Nov 2025 16:44:48 GMT
- Title: Proactive Hearing Assistants that Isolate Egocentric Conversations
- Authors: Guilin Hu, Malek Itani, Tuochao Chen, Shyamnath Gollakota,
- Abstract summary: We introduce proactive hearing assistants that automatically identify and separate the wearer's conversation partners.<n>Our system operates on egocentric audio and uses the wearer's self-speech as an anchor.<n>Our work marks a step toward hearing assistants that adapt proactively to conversational dynamics and engagement.
- Score: 9.444316926459196
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce proactive hearing assistants that automatically identify and separate the wearer's conversation partners, without requiring explicit prompts. Our system operates on egocentric binaural audio and uses the wearer's self-speech as an anchor, leveraging turn-taking behavior and dialogue dynamics to infer conversational partners and suppress others. To enable real-time, on-device operation, we propose a dual-model architecture: a lightweight streaming model runs every 12.5 ms for low-latency extraction of the conversation partners, while a slower model runs less frequently to capture longer-range conversational dynamics. Results on real-world 2- and 3-speaker conversation test sets, collected with binaural egocentric hardware from 11 participants totaling 6.8 hours, show generalization in identifying and isolating conversational partners in multi-conversation settings. Our work marks a step toward hearing assistants that adapt proactively to conversational dynamics and engagement. More information can be found on our website: https://proactivehearing.cs.washington.edu/
Related papers
- Hear You in Silence: Designing for Active Listening in Human Interaction with Conversational Agents Using Context-Aware Pacing [17.874659591744486]
"Active listening" is overlooked in the design of Conversational Agents (CAs)<n>We distill five context-aware pacing strategies: Reflective Silence, Facilitative Silence, Empathic Silence, Holding Space, and Immediate Response.<n>This work shows how insights from human conversation like context-aware pacing can empower the design of more empathic human-AI communication.
arXiv Detail & Related papers (2026-02-05T19:08:06Z) - F-Actor: Controllable Conversational Behaviour in Full-Duplex Models [70.48189107402145]
We present first open, instruction-following full-stage conversational speech model that can be trained efficiently under typical academic resource constraints.<n>Our model requires just 2,000 hours of data, without relying on large-scale pretraining or multi-stage pretraining.<n>Both the model and training code will be released to enable reproducible research on controllable full-like controllable full-stage speech systems.
arXiv Detail & Related papers (2026-01-16T14:25:57Z) - TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation [72.46711449668814]
We introduce TAVID, a unified framework that generates both interactive faces and conversational speech in a synchronized manner.<n>We evaluate our system across four dimensions: talking face realism, listening head responsiveness, dyadic interaction, and speech quality.
arXiv Detail & Related papers (2025-12-23T12:04:23Z) - Chronological Thinking in Full-Duplex Spoken Dialogue Language Models [66.84843878538207]
Chronological Thinking aims to improve response quality in full SDLMs.<n>No additional latency: once the user stops speaking, the agent halts thinking and begins speaking without further delay.<n>Results: Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations.
arXiv Detail & Related papers (2025-10-02T10:28:11Z) - MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions [70.93364531054273]
We introduce MultiVox, the first benchmark to evaluate the ability of voice assistants to integrate spoken and visual cues.<n>Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features.<n>Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
arXiv Detail & Related papers (2025-07-14T23:20:42Z) - Aligning Spoken Dialogue Models from User Interactions [55.192134724622235]
We propose a novel preference alignment framework to improve spoken dialogue models on realtime conversations from user interactions.<n>We create a dataset of more than 150,000 preference pairs from raw multi-turn speech conversations annotated with AI feedback.<n>Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.
arXiv Detail & Related papers (2025-06-26T16:45:20Z) - DualTalk: Dual-Speaker Interaction for 3D Talking Head Conversations [18.419225973482423]
Existing 3D talking head generation models focus solely on speaking or listening.<n>We propose a new task -- multi-round dual-speaker interaction for 3D talking head generation.<n>We introduce DualTalk, a novel unified framework that integrates the dynamic behaviors of speakers and listeners.
arXiv Detail & Related papers (2025-05-23T16:49:05Z) - LLAMAPIE: Proactive In-Ear Conversation Assistants [9.312108526830665]
We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices.<n>Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations.
arXiv Detail & Related papers (2025-05-07T02:08:56Z) - Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics [54.03209351287654]
We propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities.<n>We present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events.<n>We will open source our evaluation platform to promote the development of advanced conversational AI systems.
arXiv Detail & Related papers (2025-03-03T04:46:04Z) - Target conversation extraction: Source separation using turn-taking dynamics [23.189364779538757]
We introduce the novel task of target conversation extraction, where the goal is to extract the audio of a target conversation based on the speaker embedding of one of its participants.
Using neural networks, we show the feasibility of our approach on English and Mandarin conversation datasets.
In the presence of interfering speakers, our results show an 8.19 dB improvement in signal-to-noise ratio for 2-speaker conversations and a 7.92 dB improvement for 2-4-speaker conversations.
arXiv Detail & Related papers (2024-07-15T22:55:27Z) - Interactive Conversational Head Generation [68.76774230274076]
We introduce a new conversation head generation benchmark for synthesizing behaviors of a single interlocutor in a face-to-face conversation.
The capability to automatically synthesize interlocutors which can participate in long and multi-turn conversations is vital and offer benefits for various applications.
arXiv Detail & Related papers (2023-07-05T08:06:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.