Overview of the L3DAS23 Challenge on Audio-Visual Extended Reality
- URL: http://arxiv.org/abs/2402.09245v1
- Date: Wed, 14 Feb 2024 15:34:28 GMT
- Title: Overview of the L3DAS23 Challenge on Audio-Visual Extended Reality
- Authors: Christian Marinoni, Riccardo Fosco Gramaccioni, Changan Chen, Aurelio
Uncini, Danilo Comminiello
- Abstract summary: The primary goal of the L3DAS23 Signal Processing Grand Challenge at ICASSP 2023 is to promote and support collaborative research on machine learning for 3D audio signal processing.
We provide a brand-new dataset, which maintains the same general characteristics of the L3DAS21 and L3DAS22 datasets.
We propose updated baseline models for both tasks that can now support audio-image couples as input and a supporting API to replicate our results.
- Score: 15.034352805342937
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The primary goal of the L3DAS23 Signal Processing Grand Challenge at ICASSP
2023 is to promote and support collaborative research on machine learning for
3D audio signal processing, with a specific emphasis on 3D speech enhancement
and 3D Sound Event Localization and Detection in Extended Reality applications.
As part of our latest competition, we provide a brand-new dataset, which
maintains the same general characteristics of the L3DAS21 and L3DAS22 datasets,
but with first-order Ambisonics recordings from multiple reverberant simulated
environments. Moreover, we start exploring an audio-visual scenario by
providing images of these environments, as perceived by the different
microphone positions and orientations. We also propose updated baseline models
for both tasks that can now support audio-image couples as input and a
supporting API to replicate our results. Finally, we present the results of the
participants. Further details about the challenge are available at
https://www.l3das.com/icassp2023.
Related papers
- 3D Audio-Visual Segmentation [44.61476023587931]
Recognizing the sounding objects in scenes is a longstanding objective in embodied AI, with diverse applications in robotics and AR/VR/MR.
We propose a new approach, EchoSegnet, characterized by integrating the ready-to-use knowledge from pretrained 2D audio-visual foundation models.
Experiments demonstrate that EchoSegnet can effectively segment sounding objects in 3D space on our new benchmark, representing a significant advancement in the field of embodied AI.
arXiv Detail & Related papers (2024-11-04T16:30:14Z) - Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time [73.7845280328535]
We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio.
Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking.
We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.
arXiv Detail & Related papers (2024-07-01T23:32:25Z) - Novel-View Acoustic Synthesis from 3D Reconstructed Rooms [17.72902700567848]
We investigate the benefit of combining blind audio recordings with 3D scene information for novel-view acoustic synthesis.
We identify the main challenges of novel-view acoustic synthesis as sound source localization, separation, and dereverberation.
We show that incorporating room impulse responses (RIRs) derived from 3D reconstructed rooms enables the same network to jointly tackle these tasks.
arXiv Detail & Related papers (2023-10-23T17:34:31Z) - Team AcieLee: Technical Report for EPIC-SOUNDS Audio-Based Interaction
Recognition Challenge 2023 [8.699868810184752]
The task is to classify the audio caused by interactions between objects, or from events of the camera wearer.
We conducted exhaustive experiments and found learning rate step decay, backbone frozen, label smoothing and focal loss contribute most to the performance improvement.
This proposed method allowed us to achieve 3rd place in the CVPR 2023 workshop of EPIC-SOUNDS Audio-Based Interaction Recognition Challenge.
arXiv Detail & Related papers (2023-06-15T09:49:07Z) - AV-NeRF: Learning Neural Fields for Real-World Audio-Visual Scene
Synthesis [61.07542274267568]
We study a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning.
We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF.
We present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields.
arXiv Detail & Related papers (2023-02-04T04:17:19Z) - SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning [127.1119359047849]
We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments.
It generates highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations.
SoundSpaces 2.0 is publicly available to facilitate wider research for perceptual systems that can both see and hear.
arXiv Detail & Related papers (2022-06-16T17:17:44Z) - DSGN++: Exploiting Visual-Spatial Relation forStereo-based 3D Detectors [60.88824519770208]
Camera-based 3D object detectors are welcome due to their wider deployment and lower price than LiDAR sensors.
We revisit the prior stereo modeling DSGN about the stereo volume constructions for representing both 3D geometry and semantics.
We propose our approach, DSGN++, aiming for improving information flow throughout the 2D-to-3D pipeline.
arXiv Detail & Related papers (2022-04-06T18:43:54Z) - L3DAS22 Challenge: Learning 3D Audio Sources in a Real Office
Environment [12.480610577162478]
The L3DAS22 Challenge is aimed at encouraging the development of machine learning strategies for 3D speech enhancement and 3D sound localization and detection.
This challenge improves and extends the tasks of the L3DAS21 edition.
arXiv Detail & Related papers (2022-02-21T17:05:39Z) - L3DAS21 Challenge: Machine Learning for 3D Audio Signal Processing [6.521891605165917]
The L3DAS21 Challenge is aimed at encouraging and fostering collaborative research on machine learning for 3D audio signal processing.
We release the L3DAS21 dataset, a 65 hours 3D audio corpus, accompanied with a Python API that facilitates the data usage and results submission stage.
arXiv Detail & Related papers (2021-04-12T14:29:54Z) - Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source.
Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations.
We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.