Source Separation and Depthwise Separable Convolutions for Computer
Audition
- URL: http://arxiv.org/abs/2012.03359v1
- Date: Sun, 6 Dec 2020 19:30:26 GMT
- Title: Source Separation and Depthwise Separable Convolutions for Computer
Audition
- Authors: Gabriel Mersy and Jin Hong Kuan
- Abstract summary: We train a depthwise separable convolutional neural network on a challenging electronic dance music data set.
It is shown that source separation improves classification performance in a limited-data setting compared to the standard single spectrogram approach.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given recent advances in deep music source separation, we propose a feature
representation method that combines source separation with a state-of-the-art
representation learning technique that is suitably repurposed for computer
audition (i.e. machine listening). We train a depthwise separable convolutional
neural network on a challenging electronic dance music (EDM) data set and
compare its performance to convolutional neural networks operating on both
source separated and standard spectrograms. It is shown that source separation
improves classification performance in a limited-data setting compared to the
standard single spectrogram approach.
Related papers
- Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - High-Quality Visually-Guided Sound Separation from Diverse Categories [56.92841782969847]
DAVIS is a Diffusion-based Audio-VIsual Separation framework.
It synthesizes separated sounds directly from Gaussian noise, conditioned on both the audio mixture and the visual information.
We compare DAVIS to existing state-of-the-art discriminative audio-visual separation methods on the AVE and MUSIC datasets.
arXiv Detail & Related papers (2023-07-31T19:41:49Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - AudioSlots: A slot-centric generative model for audio separation [26.51135156983783]
We present AudioSlots, a slot-centric generative model for blind source separation in the audio domain.
We train the model in an end-to-end manner using a permutation-equivariant loss function.
Our results on Libri2Mix speech separation constitute a proof of concept that this approach shows promise.
arXiv Detail & Related papers (2023-05-09T16:28:07Z) - Hybrid Y-Net Architecture for Singing Voice Separation [0.0]
The proposed architecture performs end-to-end hybrid source separation by extracting features from both spectrogram and waveform domains.
Inspired by the U-Net architecture, Y-Net predicts a spectrogram mask to separate vocal sources from a mixture signal.
arXiv Detail & Related papers (2023-03-05T07:54:49Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - Training a Deep Neural Network via Policy Gradients for Blind Source
Separation in Polyphonic Music Recordings [1.933681537640272]
We propose a method for the blind separation of sounds of musical instruments in audio signals.
We describe the individual tones via a parametric model, training a dictionary to capture the relative amplitudes of the harmonics.
Our algorithm yields high-quality results with particularly low interference on a variety of different audio samples.
arXiv Detail & Related papers (2021-07-09T06:17:04Z) - Deep Convolutional and Recurrent Networks for Polyphonic Instrument
Classification from Monophonic Raw Audio Waveforms [30.3491261167433]
Sound Event Detection and Audio Classification tasks are traditionally addressed through time-frequency representations of audio signals such as spectrograms.
Deep neural networks as efficient feature extractors has enabled the direct use of audio signals for classification purposes.
We attempt to recognize musical instruments in polyphonic audio by only feeding their raw waveforms into deep learning models.
arXiv Detail & Related papers (2021-02-13T13:44:46Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Spatial and spectral deep attention fusion for multi-channel speech
separation using deep embedding features [60.20150317299749]
Multi-channel deep clustering (MDC) has acquired a good performance for speech separation.
We propose a deep attention fusion method to dynamically control the weights of the spectral and spatial features and combine them deeply.
Experimental results show that the proposed method outperforms MDC baseline and even better than the ideal binary mask (IBM)
arXiv Detail & Related papers (2020-02-05T03:49:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.