SpeechLess: Micro-utterance with Personalized Spatial Memory-aware Assistant in Everyday Augmented Reality
- URL: http://arxiv.org/abs/2602.00793v1
- Date: Sat, 31 Jan 2026 16:01:32 GMT
- Title: SpeechLess: Micro-utterance with Personalized Spatial Memory-aware Assistant in Everyday Augmented Reality
- Authors: Yoonsang Kim, Devshree Jadeja, Divyansh Pradhan, Yalong Yang, Arie Kaufman,
- Abstract summary: SpeechLess is a wearable AR assistant that introduces a speech-based intent control paradigm grounded in personalized spatial memory.<n>Our results indicate that SpeechLess can improve everyday information access, reduce articulation effort, and support socially acceptable use without substantially degrading perceived usability or intent resolution accuracy across diverse everyday environments.
- Score: 6.523396381538382
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaking aloud to a wearable AR assistant in public can be socially awkward, and re-articulating the same requests every day creates unnecessary effort. We present SpeechLess, a wearable AR assistant that introduces a speech-based intent granularity control paradigm grounded in personalized spatial memory. SpeechLess helps users "speak less," while still obtaining the information they need, and supports gradual explicitation of intent when more complex expression is required. SpeechLess binds prior interactions to multimodal personal context-space, time, activity, and referents-to form spatial memories, and leverages them to extrapolate missing intent dimensions from under-specified user queries. This enables users to dynamically adjust how explicitly they express their informational needs, from full-utterance to micro/zero-utterance interaction. We motivate our design through a week-long formative study using a commercial smart glasses platform, revealing discomfort with public voice use, frustration with repetitive speech, and hardware constraints. Building on these insights, we design SpeechLess, and evaluate it through controlled lab and in-the-wild studies. Our results indicate that regulated speech-based interaction, can improve everyday information access, reduce articulation effort, and support socially acceptable use without substantially degrading perceived usability or intent resolution accuracy across diverse everyday environments.
Related papers
- Proactive Conversational Assistant for a Procedural Manual Task based on Audio and IMU [7.116403133334644]
We propose a real-time conversational assistant that provides comprehensive guidance for a procedural task using only lightweight privacy-preserving modalities.<n>This assistant proactively communicates step-by-step instructions to a user performing a furniture assembly task, and answers user questions.
arXiv Detail & Related papers (2026-02-17T16:41:51Z) - Plug-and-Play Clarifier: A Zero-Shot Multimodal Framework for Egocentric Intent Disambiguation [60.63465682731118]
The performance of egocentric AI agents is fundamentally limited by multimodal intent ambiguity.<n>We introduce the Plug-and-Play Clarifier, a zero-shot and modular framework that decomposes the problem into discrete, solvable sub-tasks.<n>Our framework improves the intent clarification performance of small language models by approximately 30%, making them competitive with significantly larger counterparts.
arXiv Detail & Related papers (2025-11-12T04:28:14Z) - MultiVox: A Benchmark for Evaluating Voice Assistants for Multimodal Interactions [70.93364531054273]
We introduce MultiVox, the first benchmark to evaluate the ability of voice assistants to integrate spoken and visual cues.<n>Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features.<n>Our evaluation on 10 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
arXiv Detail & Related papers (2025-07-14T23:20:42Z) - Spatial Audio Processing with Large Language Model on Wearable Devices [6.345647878712574]
We present a novel system architecture that incorporates spatial speech understanding into large language models (LLMs)<n>SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of $25.72circ$-a substantial improvement compared to the 88.52circ$ median error in existing work-with a word error rate (WER) of 5.3.<n>SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16$circ$.
arXiv Detail & Related papers (2025-04-11T18:19:59Z) - Gesture-Aware Zero-Shot Speech Recognition for Patients with Language Disorders [10.664605070306417]
We propose a gesture-aware Automatic Speech Recognition (ASR) system with zero-shot learning for individuals with speech impairments.<n>Experiment results and analyses show that including gesture information significantly enhances semantic understanding.
arXiv Detail & Related papers (2025-02-18T14:15:55Z) - OpenOmni: Advancing Open-Source Omnimodal Large Language Models with Progressive Multimodal Alignment and Real-Time Self-Aware Emotional Speech Synthesis [95.27191872116306]
name is a two-stage training framework that integrates omnimodal alignment and speech generation.<n>It surpasses state-of-the-art models across omnimodal, vision-language, and speech-language benchmarks.<n>name achieves real-time speech generation with 1s latency at non-autoregressive mode.
arXiv Detail & Related papers (2025-01-08T15:18:09Z) - Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Augmented Conversation with Embedded Speech-Driven On-the-Fly Referencing in AR [16.50212867051533]
This paper introduces the concept of augmented conversation.
It aims to support co-located in-person conversations via embedded speech-driven on-the-fly referencing in augmented reality (AR)
arXiv Detail & Related papers (2024-05-28T19:10:47Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose a self-supervised continual learning approach for Automatic Speech Recognition.<n>We use a memory-enhanced ASR model from the literature to decode new words from the slides.<n>We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - Exploring Speech Recognition, Translation, and Understanding with
Discrete Speech Units: A Comparative Study [68.88536866933038]
Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies.
Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations.
Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length.
arXiv Detail & Related papers (2023-09-27T17:21:13Z) - LipLearner: Customizable Silent Speech Interactions on Mobile Devices [15.445920726854595]
We leverage contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort.
Our model exhibits high robustness to different lighting, posture, and gesture conditions on an in-the-wild dataset.
A user study demonstrated that with LipLearner, users could define their own commands with high reliability guaranteed by an online incremental learning scheme.
arXiv Detail & Related papers (2023-02-12T13:10:57Z) - Speaker De-identification System using Autoencoders and Adversarial
Training [58.720142291102135]
We propose a speaker de-identification system based on adversarial training and autoencoders.
Experimental results show that combining adversarial learning and autoencoders increase the equal error rate of a speaker verification system.
arXiv Detail & Related papers (2020-11-09T19:22:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.