Back to Ear: Perceptually Driven High Fidelity Music Reconstruction
- URL: http://arxiv.org/abs/2509.14912v2
- Date: Thu, 06 Nov 2025 07:21:38 GMT
- Title: Back to Ear: Perceptually Driven High Fidelity Music Reconstruction
- Authors: Kangdi Wang, Zhiyue Wu, Dinghao Zhou, Rui Lin, Junyu Dai, Tao Jiang,
- Abstract summary: epsilonar-VAE is an open-source music signal reconstruction model that rethinks and optimize the Variational Autoencoders (VAEs)<n> Experiments show epsilonar-VAE at 44.1kHz substantially outperforms leading open-source models across diverse metrics.
- Score: 4.380428073231143
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Variational Autoencoders (VAEs) are essential for large-scale audio tasks like diffusion-based generation. However, existing open-source models often neglect auditory perceptual aspects during training, leading to weaknesses in phase accuracy and stereophonic spatial representation. To address these challenges, we propose {\epsilon}ar-VAE, an open-source music signal reconstruction model that rethinks and optimizes the VAE training paradigm. Our contributions are threefold: (i) A K-weighting perceptual filter applied prior to loss calculation to align the objective with auditory perception. (ii) Two novel phase losses: a Correlation Loss for stereo coherence, and a Phase Loss using its derivatives--Instantaneous Frequency and Group Delay--for precision. (iii) A new spectral supervision paradigm where magnitude is supervised by all four Mid/Side/Left/Right components, while phase is supervised only by the LR components. Experiments show {\epsilon}ar-VAE at 44.1kHz substantially outperforms leading open-source models across diverse metrics, showing particular strength in reconstructing high-frequency harmonics and the spatial characteristics.
Related papers
- Flexible Gravitational-Wave Parameter Estimation with Transformers [73.44614054040267]
We introduce a flexible transformer-based architecture paired with a training strategy that enables adaptation to diverse analysis settings at inference time.<n>We demonstrate that a single flexible model -- called Dingo-T1 -- can analyze 48 gravitational-wave events from the third LIGO-Virgo-KAGRA Observing Run.
arXiv Detail & Related papers (2025-12-02T17:49:08Z) - Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective [73.86108756585857]
We analyze encoder/decoder behaviors and find that decoders depend strongly on high-frequency latent components to recover details.<n>We introduce FreqWarm, a plug-and-play frequency warm-up curriculum that increases early-stage exposure to high-frequency latent signals.
arXiv Detail & Related papers (2025-11-27T09:20:36Z) - SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection [6.042897432654865]
Spectral-cONtrastive Audio Residuals (AR) is a frequency-guided framework for deepfake audio detectors.<n>AR disentangles an audio signal into complementary representations.<n> evaluated on the ASVspoof 2021 and in-the-wild benchmarks.
arXiv Detail & Related papers (2025-11-26T12:16:38Z) - SWAN: Self-supervised Wavelet Neural Network for Hyperspectral Image Unmixing [0.2624902795082451]
We present SWAN: a three-stage, self-supervised wavelet neural network for estimation of endmembers and abundances from hyperspectral imagery.<n>The idea is to exploit latent symmetries from thus obtained invariant and covariant features using a self-supervised learning paradigm.<n> Experiments are conducted on two benchmark synthetic data sets with different signal-to-noise ratios as well as on three real benchmark hyperspectral data sets.
arXiv Detail & Related papers (2025-10-26T10:05:48Z) - prNet: Data-Driven Phase Retrieval via Stochastic Refinement [0.0]
We propose a novel framework for phase retrieval that leverages Langevin dynamics to enable efficient posterior sampling.<n>Our method navigates the perception-distortion tradeoff through a combination of sampling, learned denoising, and model-based updates.
arXiv Detail & Related papers (2025-07-13T12:25:06Z) - SpINRv2: Implicit Neural Representation for Passband FMCW Radars [0.15193212081459279]
We present SpINRv2, a neural framework for high-fidelity volumetric reconstruction using Frequency-Modulated Continuous-Wave radar.<n>Our core contribution is a fully differentiable frequency-domain forward model that captures the complex radar response using closed-form synthesis.<n>We introduce sparsity and regularization to disambiguate sub-bin ambiguities that arise at fine range resolutions.
arXiv Detail & Related papers (2025-06-09T19:21:27Z) - Blind Estimation of Sub-band Acoustic Parameters from Ambisonics Recordings using Spectro-Spatial Covariance Features [10.480691005356967]
We propose a unified framework that blindly estimates reverberation time (T60), direct-to-reverberant ratio (DRR) and clarity (C50) across 10 frequency bands.<n>The proposed framework utilizes a novel feature named Spectro-Spatial Co Vector (SSCV), efficiently representing temporal, spectral as well as spatial information of the FOA signal.
arXiv Detail & Related papers (2024-11-05T15:20:23Z) - Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - DiffusionAD: Norm-guided One-step Denoising Diffusion for Anomaly Detection [80.20339155618612]
DiffusionAD is a novel anomaly detection pipeline comprising a reconstruction sub-network and a segmentation sub-network.<n>A rapid one-step denoising paradigm achieves hundreds of times acceleration while preserving comparable reconstruction quality.<n>Considering the diversity in the manifestation of anomalies, we propose a norm-guided paradigm to integrate the benefits of multiple noise scales.
arXiv Detail & Related papers (2023-03-15T16:14:06Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - FAMLP: A Frequency-Aware MLP-Like Architecture For Domain Generalization [73.41395947275473]
We propose a novel frequency-aware architecture, in which the domain-specific features are filtered out in the transformed frequency domain.
Experiments on three benchmarks demonstrate significant performance, outperforming the state-of-the-art methods by a margin of 3%, 4% and 9%, respectively.
arXiv Detail & Related papers (2022-03-24T07:26:29Z) - Conditioning Trick for Training Stable GANs [70.15099665710336]
We propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training.
We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition.
arXiv Detail & Related papers (2020-10-12T16:50:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.