Related papers: Systematic Evaluation of Time-Frequency Features for Binaural Sound Source Localization

Systematic Evaluation of Time-Frequency Features for Binaural Sound Source Localization

URL: http://arxiv.org/abs/2511.13487v2
Date: Tue, 18 Nov 2025 13:25:04 GMT
Title: Systematic Evaluation of Time-Frequency Features for Binaural Sound Source Localization
Authors: Davoud Shariat Panah, Alessandro Ragano, Dan Barry, Jan Skoglund, Andrew Hines,
Abstract summary: This study focuses on how feature selection influences model performance across diverse conditions.<n>We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features.
Score: 47.16858222861157
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram, interaural phase difference - IPD). Evaluations on in-domain and out-of-domain data with mismatched head-related transfer functions (HRTFs) reveal that carefully chosen feature combinations often outperform increases in model complexity. While two-feature sets such as ILD + IPD are sufficient for in-domain SSL, generalization to diverse content requires richer inputs combining channel spectrograms with both ILD and IPD. Using the optimal feature sets, our low-complexity CNN model achieves competitive performance. Our findings underscore the importance of feature design in binaural SSL and provide practical guidance for both domain-specific and general-purpose localization.

Related papers

Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture [0.0]
This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics.<n>The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets.
arXiv Detail & Related papers (2025-04-25T05:57:22Z)
FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z)
Spatial-Spectral Diffusion Contrastive Representation Network for Hyperspectral Image Classification [8.600534616819333]
This paper presents a Spatial-Spectral Diffusion Contrastive Representation Network (DiffCRN)<n>DiffCRN is based on denoising diffusion probabilistic model (DDPM) combined with contrastive learning (CL) for hyperspectral images classification.<n> Experiments conducted on widely used four HSI datasets demonstrate the improved performance of the proposed DiffCRN.
arXiv Detail & Related papers (2025-02-27T02:34:23Z)
Frequency Domain Enhanced U-Net for Low-Frequency Information-Rich Image Segmentation in Surgical and Deep-Sea Exploration Robots [34.28684917337352]
We address the differences in frequency band sensitivity between CNNs and the human visual system.<n>We propose a wavelet adaptive spectrum fusion (WASF) method inspired by biological vision mechanisms to balance cross-frequency image features.<n>We develop the FE-UNet model, which employs a SAM2 backbone network and incorporates fine-tuned Hiera-Large modules to ensure segmentation accuracy.
arXiv Detail & Related papers (2025-02-06T07:24:34Z)
Optimizing Speech Multi-View Feature Fusion through Conditional Computation [51.23624575321469]
Self-supervised learning (SSL) features provide lightweight and versatile multi-view speech representations.<n> SSL features conflict with traditional spectral features like FBanks in terms of update directions.<n>We propose a novel generalized feature fusion framework grounded in conditional computation.
arXiv Detail & Related papers (2025-01-14T12:12:06Z)
Hybrid Convolutional and Attention Network for Hyperspectral Image Denoising [54.110544509099526]
Hyperspectral image (HSI) denoising is critical for the effective analysis and interpretation of hyperspectral data. We propose a hybrid convolution and attention network (HCANet) to enhance HSI denoising. Experimental results on mainstream HSI datasets demonstrate the rationality and effectiveness of the proposed HCANet.
arXiv Detail & Related papers (2024-03-15T07:18:43Z)
Embedded feature selection in LSTM networks with multi-objective evolutionary ensemble learning for time series forecasting [49.1574468325115]
We present a novel feature selection method embedded in Long Short-Term Memory networks. Our approach optimize the weights and biases of the LSTM in a partitioned manner. Experimental evaluations on air quality time series data from Italy and southeast Spain demonstrate that our method substantially improves the ability generalization of conventional LSTMs.
arXiv Detail & Related papers (2023-12-29T08:42:10Z)
Feature Aggregation in Joint Sound Classification and Localization Neural Networks [0.0]
Current state-of-the-art sound source localization deep learning networks lack feature aggregation within their architecture. We adapt feature aggregation techniques from computer vision neural networks to signal detection neural networks.
arXiv Detail & Related papers (2023-10-29T16:37:14Z)
Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment. We implement this algorithm in a real-time robotic system with a microphone array. The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.