EgoSound: Benchmarking Sound Understanding in Egocentric Videos
- URL: http://arxiv.org/abs/2602.14122v1
- Date: Sun, 15 Feb 2026 12:46:35 GMT
- Title: EgoSound: Benchmarking Sound Understanding in Egocentric Videos
- Authors: Bingwen Zhu, Yuqian Fu, Qiaole Dong, Guolei Sun, Tianwen Qian, Yuzheng Wu, Danda Pani Paudel, Xiangyang Xue, Yanwei Fu,
- Abstract summary: We introduce EgoSound, the first benchmark designed to evaluate egocentric sound understanding in MLLMs.<n>EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences.<n>It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning.
- Score: 68.1897133235638
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in vision-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.
Related papers
- Learning Situated Awareness in the Real World [63.75211123289058]
SAW-Bench is a novel benchmark for evaluating egocentric situated awareness using real-world videos.<n>It probes a model's observer-centric understanding with six different awareness tasks.<n>Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash.
arXiv Detail & Related papers (2026-02-18T18:22:52Z) - EgoAVU: Egocentric Audio-Visual Understanding [66.1760617001607]
EgoAVU is a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers.<n>EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling.<n>Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench.
arXiv Detail & Related papers (2026-02-05T19:16:55Z) - EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding [80.64794443484698]
EgoIllusion is a first benchmark to evaluate hallucinations in MLLMs in egocentric videos.<n>EgoIllusion comprises 1,400 videos paired with 8,000 human-annotated open and closed-ended questions.<n> Evaluations across ten MLLMs reveal significant challenges, including powerful models like GPT-4o and Gemini, achieving only 59% accuracy.
arXiv Detail & Related papers (2025-08-18T07:39:55Z) - Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - Egocentric Audio-Visual Object Localization [51.434212424829525]
We propose a geometry-aware temporal aggregation module to handle the egomotion explicitly.
The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations.
It improves cross-modal localization robustness by disentangling visually-indicated audio representation.
arXiv Detail & Related papers (2023-03-23T17:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.