Artificially Synthesising Data for Audio Classification and Segmentation
to Improve Speech and Music Detection in Radio Broadcast
- URL: http://arxiv.org/abs/2102.09959v1
- Date: Fri, 19 Feb 2021 14:47:05 GMT
- Title: Artificially Synthesising Data for Audio Classification and Segmentation
to Improve Speech and Music Detection in Radio Broadcast
- Authors: Satvik Venkatesh, David Moffat, Alexis Kirke, G\"ozel Shakeri, Stephen
Brewster, J\"org Fachner, Helen Odell-Miller, Alex Street, Nicolas Farina,
Sube Banerjee, and Eduardo Reck Miranda
- Abstract summary: We present a novel procedure that artificially synthesises data that resembles radio signals.
We trained a Convolutional Recurrent Neural Network (CRNN) on this synthesised data and outperformed state-of-the-art algorithms for music-speech detection.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Segmenting audio into homogeneous sections such as music and speech helps us
understand the content of audio. It is useful as a pre-processing step to
index, store, and modify audio recordings, radio broadcasts and TV programmes.
Deep learning models for segmentation are generally trained on copyrighted
material, which cannot be shared. Annotating these datasets is time-consuming
and expensive and therefore, it significantly slows down research progress. In
this study, we present a novel procedure that artificially synthesises data
that resembles radio signals. We replicate the workflow of a radio DJ in mixing
audio and investigate parameters like fade curves and audio ducking. We trained
a Convolutional Recurrent Neural Network (CRNN) on this synthesised data and
outperformed state-of-the-art algorithms for music-speech detection. This paper
demonstrates the data synthesis procedure as a highly effective technique to
generate large datasets to train deep neural networks for audio segmentation.
Related papers
- Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue
State Tracking [19.754211231250544]
We develop cascading and end-to-end models, train them with our synthetic audio dataset, and test them on actual human speech data.
Experimental results showed that models trained solely on synthetic datasets can generalize their performance to human voice data.
arXiv Detail & Related papers (2023-12-04T12:25:46Z) - Self-Supervised Contrastive Learning for Robust Audio-Sheet Music
Retrieval Systems [3.997809845676912]
We show that self-supervised contrastive learning can mitigate the scarcity of annotated data from real music content.
We employ the snippet embeddings in the higher-level task of cross-modal piece identification.
In this work, we observe that the retrieval quality improves from 30% up to 100% when real music data is present.
arXiv Detail & Related papers (2023-09-21T14:54:48Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Unsupervised Learning of Deep Features for Music Segmentation [8.528384027684192]
Music segmentation is a problem of identifying boundaries between, and labeling, distinct music segments.
The performance of a range of music segmentation algorithms has been dependent on the audio features chosen to represent the audio.
In this work, unsupervised training of deep feature embeddings using convolutional neural networks (CNNs) is explored for music segmentation.
arXiv Detail & Related papers (2021-08-30T01:55:44Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Audio-based Near-Duplicate Video Retrieval with Audio Similarity
Learning [19.730467023817123]
We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs.
We train our network following a triplet generation process and optimize the triplet loss function.
The proposed approach achieves very competitive results compared to three state-of-the-art methods.
arXiv Detail & Related papers (2020-10-17T08:12:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.