Exploiting Time-Frequency Conformers for Music Audio Enhancement
- URL: http://arxiv.org/abs/2308.12599v1
- Date: Thu, 24 Aug 2023 06:56:54 GMT
- Title: Exploiting Time-Frequency Conformers for Music Audio Enhancement
- Authors: Yunkee Chae, Junghyun Koo, Sungho Lee and Kyogu Lee
- Abstract summary: We propose a music enhancement system based on the Conformer architecture.
Our approach explores the attention mechanisms of the Conformer and examines their performance to discover the best approach for the music enhancement task.
- Score: 21.243039524049614
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the proliferation of video platforms on the internet, recording musical
performances by mobile devices has become commonplace. However, these
recordings often suffer from degradation such as noise and reverberation, which
negatively impact the listening experience. Consequently, the necessity for
music audio enhancement (referred to as music enhancement from this point
onward), involving the transformation of degraded audio recordings into
pristine high-quality music, has surged to augment the auditory experience. To
address this issue, we propose a music enhancement system based on the
Conformer architecture that has demonstrated outstanding performance in speech
enhancement tasks. Our approach explores the attention mechanisms of the
Conformer and examines their performance to discover the best approach for the
music enhancement task. Our experimental results show that our proposed model
achieves state-of-the-art performance on single-stem music enhancement.
Furthermore, our system can perform general music enhancement with multi-track
mixtures, which has not been examined in previous work.
Related papers
- MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [52.498942604622165]
This paper presents MuVi, a framework to generate music that aligns with video content.
MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features.
We show that MuVi demonstrates superior performance in both audio quality and temporal synchronization.
arXiv Detail & Related papers (2024-10-16T18:44:56Z) - AudioScenic: Audio-Driven Video Scene Editing [55.098754835213995]
We introduce AudioScenic, an audio-driven framework designed for video scene editing.
AudioScenic integrates audio semantics into the visual scene through a temporal-aware audio semantic injection process.
We present an audio Magnitude Modulator module that adjusts the temporal dynamics of the scene in response to changes in audio magnitude.
Second, the audio Frequency Fuser module is designed to ensure temporal consistency by aligning the frequency of the audio with the dynamics of the video scenes.
arXiv Detail & Related papers (2024-04-25T12:55:58Z) - Spectrogram-Based Detection of Auto-Tuned Vocals in Music Recordings [9.646498710102174]
This study introduces a data-driven approach leveraging triplet networks for the detection of Auto-Tuned songs.
The experimental results demonstrate the superiority of the proposed method in both accuracy and robustness compared to Rawnet2, an end-to-end model proposed for anti-spoofing.
arXiv Detail & Related papers (2024-03-08T15:19:26Z) - Music Auto-Tagging with Robust Music Representation Learned via Domain
Adversarial Training [18.71152526968065]
Existing models in Music Information Retrieval (MIR) struggle with real-world noise such as environmental and speech sounds in multimedia content.
This study proposes a method inspired by speech-related tasks to enhance music auto-tagging performance in noisy settings.
arXiv Detail & Related papers (2024-01-27T06:56:51Z) - Deep Performer: Score-to-Audio Music Performance Synthesis [30.95307878579825]
Deep Performer is a novel system for score-to-audio music performance synthesis.
Unlike speech, music often contains polyphony and long notes.
We show that our proposed model can synthesize music with clear polyphony and harmonic structures.
arXiv Detail & Related papers (2022-02-12T10:36:52Z) - Audiovisual Singing Voice Separation [25.862550744570324]
Video model takes the input of mouth movement and fuses it into the feature embeddings of an audio-based separation framework.
We create two audiovisual singing performance datasets for training and evaluation.
The proposed method outperforms audio-based methods in terms of separation quality on most test recordings.
arXiv Detail & Related papers (2021-07-01T06:04:53Z) - Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating
Source Separation [96.18178553315472]
We propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio.
We integrate both stereo generation and source separation into a unified framework, Sep-Stereo.
arXiv Detail & Related papers (2020-07-20T06:20:26Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z) - Audiovisual SlowFast Networks for Video Recognition [140.08143162600354]
We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception.
We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts.
We report results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features.
arXiv Detail & Related papers (2020-01-23T18:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.