Music Enhancement via Image Translation and Vocoding
- URL: http://arxiv.org/abs/2204.13289v1
- Date: Thu, 28 Apr 2022 05:00:07 GMT
- Title: Music Enhancement via Image Translation and Vocoding
- Authors: Nikhil Kandpal, Oriol Nieto, Zeyu Jin
- Abstract summary: This paper presents a deep learning approach to enhance low-quality music recordings.
We combine an image-to-image translation model for manipulating audio in its mel-spectrogram representation and a music vocoding model for mapping synthetically generated mel-spectrograms to perceptually realistic waveforms.
We find that this approach to music enhancement outperforms baselines which use classical methods for mel-spectrogram inversion and an end-to-end approach directly mapping noisy waveforms to clean waveforms.
- Score: 14.356705444361832
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Consumer-grade music recordings such as those captured by mobile devices
typically contain distortions in the form of background noise, reverb, and
microphone-induced EQ. This paper presents a deep learning approach to enhance
low-quality music recordings by combining (i) an image-to-image translation
model for manipulating audio in its mel-spectrogram representation and (ii) a
music vocoding model for mapping synthetically generated mel-spectrograms to
perceptually realistic waveforms. We find that this approach to music
enhancement outperforms baselines which use classical methods for
mel-spectrogram inversion and an end-to-end approach directly mapping noisy
waveforms to clean waveforms. Additionally, in evaluating the proposed method
with a listening test, we analyze the reliability of common audio enhancement
evaluation metrics when used in the music domain.
Related papers
- Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription [13.960714900433269]
Sheet Music Transformer is the first end-to-end OMR model designed to transcribe complex musical scores without relying solely on monophonic strategies.
Our model has been tested on two polyphonic music datasets and has proven capable of handling these intricate music structures effectively.
arXiv Detail & Related papers (2024-02-12T11:52:21Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - An investigation of the reconstruction capacity of stacked convolutional
autoencoders for log-mel-spectrograms [2.3204178451683264]
In audio processing applications, the generation of expressive sounds based on high-level representations demonstrates a high demand.
Modern algorithms, such as neural networks, have inspired the development of expressive synthesizers based on musical instrument compression.
This study investigates the use of stacked convolutional autoencoders for the compression of time-frequency audio representations for a variety of instruments for a single pitch.
arXiv Detail & Related papers (2023-01-18T17:19:04Z) - Multi-modal Conditional Bounding Box Regression for Music Score
Following [7.360807642941713]
This paper addresses the problem of sheet-image-based on-line audio-to-score alignment also known as score following.
A conditional neural network architecture is proposed that directly predicts x,y coordinates of the matching positions in a complete score sheet image at each point in time for a given musical performance.
arXiv Detail & Related papers (2021-05-10T12:43:35Z) - Hierarchical Timbre-Painting and Articulation Generation [92.59388372914265]
We present a fast and high-fidelity method for music generation, based on specified f0 and loudness.
The synthesized audio mimics the timbre and articulation of a target instrument.
arXiv Detail & Related papers (2020-08-30T05:27:39Z) - Audio Dequantization for High Fidelity Audio Generation in Flow-based
Neural Vocoder [29.63675159839434]
Flow-based neural vocoder has shown significant improvement in real-time speech generation task.
We propose audio dequantization methods in flow-based neural vocoder for high fidelity audio generation.
arXiv Detail & Related papers (2020-08-16T09:37:18Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.