PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching
- URL: http://arxiv.org/abs/2510.22439v2
- Date: Wed, 29 Oct 2025 15:42:33 GMT
- Title: PromptReverb: Multimodal Room Impulse Response Generation Through Latent Rectified Flow Matching
- Authors: Ali Vosoughi, Yongyi Zang, Qihui Yang, Nathan Paek, Randal Leistikow, Chenliang Xu,
- Abstract summary: Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments.<n>We present PromptReverb, a two-stage generative framework that addresses these challenges.<n>Our method enables practical applications in virtual reality, architectural acoustics, and audio production.
- Score: 28.59278750632839
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcity of full-band RIR datasets and the inability of existing models to generate acoustically accurate responses from diverse input modalities. We present PromptReverb, a two-stage generative framework that addresses these challenges. Our approach combines a variational autoencoder that upsamples band-limited RIRs to full-band quality (48 kHz), and a conditional diffusion transformer model based on rectified flow matching that generates RIRs from descriptions in natural language. Empirical evaluation demonstrates that PromptReverb produces RIRs with superior perceptual quality and acoustic accuracy compared to existing methods, achieving 8.8% mean RT60 error compared to -37% for widely used baselines and yielding more realistic room-acoustic parameters. Our method enables practical applications in virtual reality, architectural acoustics, and audio production where flexible, high-quality RIR synthesis is essential.
Related papers
- Reciprocal Latent Fields for Precomputed Sound Propagation [0.6474760227870046]
We introduce Reciprocal Latent Fields (RLF), a memory-efficient framework for encoding and predicting acoustic parameters.<n>We show that RLF maintains replication quality while reducing the memory footprint by several orders of magnitude.
arXiv Detail & Related papers (2026-02-06T18:31:11Z) - RIR-Former: Coordinate-Guided Transformer for Continuous Reconstruction of Room Impulse Responses [21.84404827658177]
RIR-Former is a grid-free, one-step feed-forward model for RIR reconstruction.<n>By introducing a sinusoidal encoding module into a transformer backbone, our method effectively incorporates microphone position information.<n> Experiments on diverse simulated acoustic environments demonstrate that RIR-Former consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-02T09:33:54Z) - Explainable Transformer-CNN Fusion for Noise-Robust Speech Emotion Recognition [2.0391237204597363]
Speech Emotion Recognition systems often degrade in performance when exposed to unpredictable acoustic interference.<n>We propose a Hybrid Transformer-CNN framework that unifies the contextual modeling of Wav2Vec 2.0 with the spectral stability of 1D-Convolutional Neural Networks.
arXiv Detail & Related papers (2025-12-20T10:05:58Z) - One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution [54.485573602613535]
We present ODTSR, a one-step diffusion transformer that performs Real-ISR considering fidelity and controllability simultaneously.<n>ODTSR employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference.<n>Experiments demonstrate that ODTSR achieves state-of-the-art (SOTA) performance on generic Real-ISR.
arXiv Detail & Related papers (2025-11-21T11:00:59Z) - Two Heads Are Better Than One: Audio-Visual Speech Error Correction with Dual Hypotheses [71.34350093068473]
This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR)<n>Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models.<n>Our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain.
arXiv Detail & Related papers (2025-10-15T08:27:16Z) - DiffusionRIR: Room Impulse Response Interpolation using Diffusion Models [16.92449230293275]
High-quality RIR estimates drive applications such as virtual microphones, sound source localization, augmented reality, and data augmentation.<n>This research addresses the challenge of estimating RIRs at unmeasured locations within a room using Denoising Diffusion Probabilistic Models (DDPM)
arXiv Detail & Related papers (2025-04-29T10:52:07Z) - AV-RIR: Audio-Visual Room Impulse Response Estimation [49.469389715876915]
Accurate estimation of Room Impulse Response (RIR) is important for speech processing and AR/VR applications.
We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and visual cues of its corresponding environment.
arXiv Detail & Related papers (2023-11-30T22:58:30Z) - RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios [36.50731790624643]
We introduce RIR-SF, a novel spatial feature based on room impulse response (RIR)
RIR-SF significantly outperforms traditional 3D spatial features, showing superior theoretical and empirical performance.
We also propose an optimized all-neural multi-channel ASR framework for RIR-SF, achieving a relative 21.3% reduction in CER for target speaker ASR in multi-channel settings.
arXiv Detail & Related papers (2023-10-31T20:42:08Z) - Neural Acoustic Context Field: Rendering Realistic Room Impulse Response
With Neural Fields [61.07542274267568]
This letter proposes a novel Neural Acoustic Context Field approach, called NACF, to parameterize an audio scene.
Driven by the unique properties of RIR, we design a temporal correlation module and multi-scale energy decay criterion.
Experimental results show that NACF outperforms existing field-based methods by a notable margin.
arXiv Detail & Related papers (2023-09-27T19:50:50Z) - Synthetic Wave-Geometric Impulse Responses for Improved Speech
Dereverberation [69.1351513309953]
We show that accurately simulating the low-frequency components of Room Impulse Responses (RIRs) is important to achieving good dereverberation.
We demonstrate that speech dereverberation models trained on hybrid synthetic RIRs outperform models trained on RIRs generated by prior geometric ray tracing methods.
arXiv Detail & Related papers (2022-12-10T20:15:23Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.