HRTFformer: A Spatially-Aware Transformer for Personalized HRTF Upsampling in Immersive Audio Rendering
- URL: http://arxiv.org/abs/2510.01891v1
- Date: Thu, 02 Oct 2025 10:59:21 GMT
- Title: HRTFformer: A Spatially-Aware Transformer for Personalized HRTF Upsampling in Immersive Audio Rendering
- Authors: Xuyi Hu, Jian Li, Shaojie Zhang, Stefan Goetz, Lorenzo Picinali, Ozgur B. Akan, Aidan O. T. Hogg,
- Abstract summary: We propose a novel transformer-based architecture for HRTF upsampling.<n>Our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy.<n> Experiments show that our model surpasses leading methods by a substantial margin in generating realistic, high-fidelity HRTFs.
- Score: 13.189906008527613
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Personalized Head-Related Transfer Functions (HRTFs) are starting to be introduced in many commercial immersive audio applications and are crucial for realistic spatial audio rendering. However, one of the main hesitations regarding their introduction is that creating personalized HRTFs is impractical at scale due to the complexities of the HRTF measurement process. To mitigate this drawback, HRTF spatial upsampling has been proposed with the aim of reducing measurements required. While prior work has seen success with different machine learning (ML) approaches, these models often struggle with long-range spatial consistency and generalization at high upsampling factors. In this paper, we propose a novel transformer-based architecture for HRTF upsampling, leveraging the attention mechanism to better capture spatial correlations across the HRTF sphere. Working in the spherical harmonic (SH) domain, our model learns to reconstruct high-resolution HRTFs from sparse input measurements with significantly improved accuracy. To enhance spatial coherence, we introduce a neighbor dissimilarity loss that promotes magnitude smoothness, yielding more realistic upsampling. We evaluate our method using both perceptual localization models and objective spectral distortion metrics. Experiments show that our model surpasses leading methods by a substantial margin in generating realistic, high-fidelity HRTFs.
Related papers
- Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction [16.426476430697587]
We present a novel approach to predict the Short-Time Objective Intelligibility (STOI) metric using a bottleneck transformer architecture.<n>Our model has shown higher correlation and lower mean squared error for both seen and unseen scenarios.
arXiv Detail & Related papers (2026-02-17T10:46:54Z) - UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass [83.7071371474926]
UniSH is a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction.<n>Our framework bridges strong, disparate priors from scene reconstruction and HMR.<n>Our model achieves state-of-the-art performance on human-centric scene reconstruction.
arXiv Detail & Related papers (2026-01-03T16:06:27Z) - QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution [53.13952833016505]
We propose a low-bit quantization model for real-world video super-resolution (VSR)<n>We use a calibration dataset to measure both spatial and temporal complexity for each layer.<n>We refine the FP and low-bit branches to achieve simultaneous optimization.
arXiv Detail & Related papers (2025-08-06T14:35:59Z) - A Machine Learning Approach for Denoising and Upsampling HRTFs [5.954160581274925]
Head-Related Transfer Functions (HRTFs) capture how sound reaches our ears, reflecting unique anatomical features and enhancing spatial perception.<n>It has been shown that personalized HRTFs improve localization accuracy, but their measurement remains time-consuming and requires a noise-free environment.<n>This paper proposes a method to address this constraint by presenting a novel technique that can upsample sparse, noisy HRTF measurements.<n>The proposed method achieves a log-spectral distortion (LSD) error of 5.41 dB and a cosine similarity loss of 0.0070, demonstrating the method's effectiveness in HRTF upsampling.
arXiv Detail & Related papers (2025-04-24T14:17:57Z) - FreSca: Scaling in Frequency Space Enhances Diffusion Models [55.75504192166779]
This paper explores frequency-based control within latent diffusion models.<n>We introduce FreSca, a novel framework that decomposes noise difference into low- and high-frequency components.<n>FreSca operates without any model retraining or architectural change, offering model- and task-agnostic control.
arXiv Detail & Related papers (2025-04-02T22:03:11Z) - Enhanced Super-Resolution Training via Mimicked Alignment for Real-World Scenes [51.92255321684027]
We propose a novel plug-and-play module designed to mitigate misalignment issues by aligning LR inputs with HR images during training.
Specifically, our approach involves mimicking a novel LR sample that aligns with HR while preserving the characteristics of the original LR samples.
We comprehensively evaluate our method on synthetic and real-world datasets, demonstrating its effectiveness across a spectrum of SR models.
arXiv Detail & Related papers (2024-10-07T18:18:54Z) - HRTF Estimation using a Score-based Prior [20.62078965099636]
We present a head-related transfer function estimation method based on a score-based diffusion model.
The HRTF is estimated in reverberant environments using natural excitation signals, e.g. human speech.
We show that the diffusion prior can account for the large variability of high-frequency content in HRTFs.
arXiv Detail & Related papers (2024-10-02T14:00:41Z) - Multi-Fidelity Residual Neural Processes for Scalable Surrogate Modeling [19.60087366873302]
Multi-fidelity surrogate modeling aims to learn an accurate surrogate at the highest fidelity level.
Deep learning approaches utilize neural network based encoders and decoders to improve scalability.
We propose Multi-fidelity Residual Neural Processes (MFRNP), a novel multi-fidelity surrogate modeling framework.
arXiv Detail & Related papers (2024-02-29T04:40:25Z) - HRTF upsampling with a generative adversarial network using a gnomonic
equiangular projection [3.921666645870036]
This paper demonstrates how generative adversarial networks (GANs) can be applied to HRTF upsampling.
We propose a novel approach that transforms the HRTF data for direct use with a convolutional super-resolution generative adversarial network (SRGAN)
Experimental results show that the proposed method outperforms all three baselines in terms of log-spectral distortion (LSD) and localisation performance.
arXiv Detail & Related papers (2023-06-09T11:05:09Z) - Hierarchical Integration Diffusion Model for Realistic Image Deblurring [71.76410266003917]
Diffusion models (DMs) have been introduced in image deblurring and exhibited promising performance.
We propose the Hierarchical Integration Diffusion Model (HI-Diff), for realistic image deblurring.
Experiments on synthetic and real-world blur datasets demonstrate that our HI-Diff outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T12:18:20Z) - Spectral Enhanced Rectangle Transformer for Hyperspectral Image
Denoising [64.11157141177208]
We propose a spectral enhanced rectangle Transformer to model the spatial and spectral correlation in hyperspectral images.
For the former, we exploit the rectangle self-attention horizontally and vertically to capture the non-local similarity in the spatial domain.
For the latter, we design a spectral enhancement module that is capable of extracting global underlying low-rank property of spatial-spectral cubes to suppress noise.
arXiv Detail & Related papers (2023-04-03T09:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.