Related papers: PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs

PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs

URL: http://arxiv.org/abs/2601.21124v1
Date: Wed, 28 Jan 2026 23:39:31 GMT
Title: PhaseCoder: Microphone Geometry-Agnostic Spatial Audio Understanding for Multimodal LLMs
Authors: Artem Dementyev, Wazeer Zulfikar, Sinan Hersek, Pascal Getreuer, Anurag Kumar, Vivek Kumar,
Abstract summary: We present PhaseCoder, a transformer-only spatial audio encoder.<n>PhaseCoder takes raw audio and microphone coordinates as inputs to perform localization and produces robust spatial embeddings.<n>We show our encoder state-of-the-art results on microphone-invariant localization benchmarks and, for the first time, enables an LLM to perform complex spatial reasoning and targeted transcription tasks from an arbitrary microphone array.
Score: 9.985118023353897
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder, a transformer-only spatial audio encoder that is agnostic to microphone geometry. PhaseCoder takes raw multichannel audio and microphone coordinates as inputs to perform localization and produces robust spatial embeddings. We demonstrate that Gemma 3n LLM can be fine-tuned to reason over "Spatial Audio Tokens" produced by PhaseCoder. We show our encoder achieves state-of-the-art results on microphone-invariant localization benchmarks and, for the first time, enables an LLM to perform complex spatial reasoning and targeted transcription tasks from an arbitrary microphone array.

Related papers

LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence [35.123477091633866]
LAMB is an audio captioning framework that bridges the modality gap between audio embeddings and the text embedding space.<n>A Cross-Modal Aligner minimizes Cauchy-Schwarz divergence while maximizing mutual information.<n>A Two-Stream Adapter that extracts semantically enriched audio embeddings delivers richer information to the Cross-Modal Aligner.
arXiv Detail & Related papers (2026-01-08T07:05:35Z)
ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation [55.76423101183408]
ViSAudio is an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture.<n>It generates high-quality audio with spatial immersion that adapts to viewpoint changes, sound-source motion, and diverse acoustic environments.
arXiv Detail & Related papers (2025-12-02T18:56:12Z)
Towards Audio Token Compression in Large Audio Language Models [26.379508239446935]
Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks.<n>However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals.<n>This paper explores techniques to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder.
arXiv Detail & Related papers (2025-11-26T02:00:38Z)
SPUR: A Plug-and-Play Framework for Integrating Spatial Audio Understanding and Reasoning into Large Audio-Language Models [62.14165748145729]
We introduce SPUR, a lightweight, plug-in approach that equips large audio-speaker models with spatial perception.<n>SPUR consists of: (i) a First-Order Ambisonics (FOA) encoder that maps channels to rotation-aware, listener-centric spatial features, integrated into target LALMs via a multimodal adapter; and (ii) SPUR-Set, a spatial QA dataset combining open-source FOA recordings with controlled simulations, emphasizing relative direction, elevation, distance, and overlap for supervised spatial reasoning.
arXiv Detail & Related papers (2025-11-10T01:29:26Z)
Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z)
Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
wav2pos: Sound Source Localization using Masked Autoencoders [12.306126455995603]
We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem. We show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input.
arXiv Detail & Related papers (2024-08-28T13:09:20Z)
Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models [43.155061160275196]
We explore the possibility of making interactions with virtual assistants more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder.
arXiv Detail & Related papers (2023-12-06T17:29:03Z)
Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input. We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z)
Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial Clustering Masks [14.942060304734497]
spatial clustering techniques can achieve significant multi-channel noise reduction across relatively arbitrary microphone configurations. LSTM neural networks have successfully been trained to recognize speech from noise on single-channel inputs, but have difficulty taking full advantage of the information in multi-channel recordings. This paper integrates these two approaches, training LSTM speech models to clean the masks generated by the Model-based EM Source Separation and Localization (MESSL) spatial clustering method.
arXiv Detail & Related papers (2020-12-02T22:35:00Z)
Multi-microphone Complex Spectral Mapping for Utterance-wise and Continuous Speech Separation [79.63545132515188]
We propose multi-microphone complex spectral mapping for speaker separation in reverberant conditions. Our system is trained on simulated room impulse responses based on a fixed number of microphones arranged in a given geometry. State-of-the-art separation performance is obtained on the simulated two-talker SMS-WSJ corpus and the real-recorded LibriCSS dataset.
arXiv Detail & Related papers (2020-10-04T22:13:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.