Related papers: WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables

URL: http://arxiv.org/abs/2601.02391v1
Date: Thu, 25 Dec 2025 06:39:21 GMT
Title: WearVox: An Egocentric Multichannel Voice Assistant Benchmark for Wearables
Authors: Zhaojiang Lin, Yong Xu, Kai Sun, Jing Zheng, Yin Huang, Surya Teja Appini, Krish Narang, Renjie Tao, Ishan Kapil Jain, Siddhant Arora, Ruizhi Li, Yiteng Huang, Kaushik Patnaik, Wenfang Xu, Suwon Shon, Yue Liu, Ahmed A Aly, Anuj Kumar, Florian Metze, Xin Luna Dong,
Abstract summary: WearVox is the first benchmark designed to rigorously evaluate voice assistants in realistic wearable scenarios.<n>It comprises 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks.<n>We benchmark leading proprietary and open-source speech Large Language Models (SLLMs) and find that most real-time SLLMs achieve accuracies ranging from 29% to 59%.
Score: 46.73480840435705
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Wearable devices such as AI glasses are transforming voice assistants into always-available, hands-free collaborators that integrate seamlessly with daily life, but they also introduce challenges like egocentric audio affected by motion and noise, rapid micro-interactions, and the need to distinguish device-directed speech from background conversations. Existing benchmarks largely overlook these complexities, focusing instead on clean or generic conversational audio. To bridge this gap, we present WearVox, the first benchmark designed to rigorously evaluate voice assistants in realistic wearable scenarios. WearVox comprises 3,842 multi-channel, egocentric audio recordings collected via AI glasses across five diverse tasks including Search-Grounded QA, Closed-Book QA, Side-Talk Rejection, Tool Calling, and Speech Translation, spanning a wide range of indoor and outdoor environments and acoustic conditions. Each recording is accompanied by rich metadata, enabling nuanced analysis of model performance under real-world constraints. We benchmark leading proprietary and open-source speech Large Language Models (SLLMs) and find that most real-time SLLMs achieve accuracies on WearVox ranging from 29% to 59%, with substantial performance degradation on noisy outdoor audio, underscoring the difficulty and realism of the benchmark. Additionally, we conduct a case study with two new SLLMs that perform inference with single-channel and multi-channel audio, demonstrating that multi-channel audio inputs significantly enhance model robustness to environmental noise and improve discrimination between device-directed and background speech. Our results highlight the critical importance of spatial audio cues for context-aware voice assistants and establish WearVox as a comprehensive testbed for advancing wearable voice AI research.

Related papers

Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z)
Fun-Audio-Chat Technical Report [71.07966678560291]
temporal resolution between speech tokens (25Hz) and text tokens (3Hz) mitigates semantic information mismatchs and incurs high computational costs.<n>We introduce Fun-Audio-Chat, a Large-stage Spoken Speech-to-scale tasks.<n>Fun-Audio-Chat 8B and MoE 30BA3B achieve competitive performance on SpeechText and Speech-to-scale tasks.
arXiv Detail & Related papers (2025-12-23T08:35:27Z)
Audio MultiChallenge: A Multi-Turn Evaluation of Spoken Dialogue Systems on Natural Human Interaction [12.216811577733125]
We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns.<n>We introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking.<n>We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid audio-native agentic and human-in-the-loop pipeline.
arXiv Detail & Related papers (2025-12-16T19:26:44Z)
VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing [45.15289852736435]
VoiceAssistant-Eval comprises 10,497 curated examples spanning 13 task categories.<n>To demonstrate its utility, we evaluate 21 open-source models and GPT-4o-Audio.<n>Results reveal three key findings: proprietary models do not universally outperform open-source models.
arXiv Detail & Related papers (2025-09-26T17:59:59Z)
USAD: Universal Speech and Audio Representation via Distillation [56.91647396619358]
Universal Speech and Audio Distillation (USAD) is a unified approach to audio representation learning.<n>USAD integrates diverse audio types - speech, sound, and music - into a single model.
arXiv Detail & Related papers (2025-06-23T17:02:00Z)
Cocktail-Party Audio-Visual Speech Recognition [58.222892601847924]
This study introduces a novel audio-visual cocktail-party dataset designed to benchmark current AVSR systems.<n>We contribute a 1526-hour AVSR dataset comprising both talking-face and silent-face segments, enabling significant performance gains in cocktail-party environments.<n>Our approach reduces WER by 67% relative to the state-of-the-art, reducing WER from 119% to 39.2% in extreme noise, without relying on explicit segmentation cues.
arXiv Detail & Related papers (2025-06-02T19:07:51Z)
Multi-Stage Speaker Diarization for Noisy Classrooms [1.4549461207028445]
This study investigates the effectiveness of multi-stage diarization models using Nvidia's NeMo diarization pipeline.<n>We assess the impact of denoising on diarization accuracy and compare various voice activity detection models.<n>We also explore a hybrid VAD approach that integrates Automatic Speech Recognition (ASR) word-level timestamps with frame-level VAD predictions.
arXiv Detail & Related papers (2025-05-16T05:35:06Z)
Robust Active Speaker Detection in Noisy Environments [29.785749048315616]
We formulate a robust active speaker detection (rASD) problem in noisy environments. Existing ASD approaches leverage both audio and visual modalities, but non-speech sounds in the surrounding environment can negatively impact performance. We propose a novel framework that utilizes audio-visual speech separation as guidance to learn noise-free audio features.
arXiv Detail & Related papers (2024-03-27T20:52:30Z)
EasyCom: An Augmented Reality Dataset to Support Algorithms for Easy Communication in Noisy Environments [43.05826988957987]
We release a dataset that contains over 5 hours of multi-modal data useful for training and testing algorithms for the application of improving conversations for an AR glasses wearer. We provide speech intelligibility, quality and signal-to-noise ratio improvement results for a baseline method and show improvements across all tested metrics.
arXiv Detail & Related papers (2021-07-09T02:00:47Z)
Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model. We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks. In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.