AERO: Audio Super Resolution in the Spectral Domain
- URL: http://arxiv.org/abs/2211.12232v1
- Date: Tue, 22 Nov 2022 12:37:01 GMT
- Title: AERO: Audio Super Resolution in the Spectral Domain
- Authors: Moshe Mandel, Or Tal, Yossi Adi
- Abstract summary: We present AERO, a audio super-resolution model that processes speech and music signals in the spectral domain.
We optimize the model using both time and frequency domain loss functions.
We demonstrate high performance across a wide range of sample rates considering both speech and music.
- Score: 15.965382891955771
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present AERO, a audio super-resolution model that processes speech and
music signals in the spectral domain. AERO is based on an encoder-decoder
architecture with U-Net like skip connections. We optimize the model using both
time and frequency domain loss functions. Specifically, we consider a set of
reconstruction losses together with perceptual ones in the form of adversarial
and feature discriminator loss functions. To better handle phase information
the proposed method operates over the complex-valued spectrogram using two
separate channels. Unlike prior work which mainly considers low and high
frequency concatenation for audio super-resolution, the proposed method
directly predicts the full frequency range. We demonstrate high performance
across a wide range of sample rates considering both speech and music. AERO
outperforms the evaluated baselines considering Log-Spectral Distance, ViSQOL,
and the subjective MUSHRA test. Audio samples and code are available at
https://pages.cs.huji.ac.il/adiyoss-lab/aero
Related papers
- Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation [0.0]
This study tackles the distinct separation of vocal components from musical spectrograms.
We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms.
We implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately.
arXiv Detail & Related papers (2024-05-30T13:47:53Z) - TIM: A Time Interval Machine for Audio-Visual Action Recognition [64.24297230981168]
We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events.
We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder.
We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE.
arXiv Detail & Related papers (2024-04-08T14:30:42Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Global Spectral Filter Memory Network for Video Object Segmentation [33.42697528492191]
This paper studies semi-supervised video object segmentation through boosting intra-frame interaction.
We propose Global Spectral Filter Memory network (GSFM), which improves intra-frame interaction through learning long-term spatial dependencies in the spectral domain.
arXiv Detail & Related papers (2022-10-11T16:02:02Z) - Deep Spectro-temporal Artifacts for Detecting Synthesized Speech [57.42110898920759]
This paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection)
In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features.
We ranked 4th and 5th in track 1 and track 2, respectively.
arXiv Detail & Related papers (2022-10-11T08:31:30Z) - Audio Spectral Enhancement: Leveraging Autoencoders for Low Latency
Reconstruction of Long, Lossy Audio Sequences [0.0]
We propose a novel approach for reconstructing higher frequencies from considerably longer sequences of low-quality MP3 audio waves.
Our architecture presents several bottlenecks while preserving the spectral structure of the audio wave via skip-connections.
We show how to leverage differential quantization techniques to reduce the initial model size by more than half while simultaneously reducing inference time.
arXiv Detail & Related papers (2021-08-08T18:06:21Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.