Automatic Lyrics Transcription using Dilated Convolutional Neural
Networks with Self-Attention
- URL: http://arxiv.org/abs/2007.06486v2
- Date: Fri, 24 Jul 2020 15:26:48 GMT
- Title: Automatic Lyrics Transcription using Dilated Convolutional Neural
Networks with Self-Attention
- Authors: Emir Demirel, Sven Ahlback, Simon Dixon
- Abstract summary: We have trained convolutional time-delay neural networks with self-attention on monophonic karaoke recordings.
Our system achieves notable improvement to the state-of-the-art in automatic lyrics transcription.
- Score: 11.232541198648159
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speech recognition is a well developed research field so that the current
state of the art systems are being used in many applications in the software
industry, yet as by today, there still does not exist such robust system for
the recognition of words and sentences from singing voice. This paper proposes
a complete pipeline for this task which may commonly be referred as automatic
lyrics transcription (ALT). We have trained convolutional time-delay neural
networks with self-attention on monophonic karaoke recordings using a sequence
classification objective for building the acoustic model. The dataset used in
this study, DAMP - Sing! 300x30x2 [1] is filtered to have songs with only
English lyrics. Different language models are tested including MaxEnt and
Recurrent Neural Networks based methods which are trained on the lyrics of pop
songs in English. An in-depth analysis of the self-attention mechanism is held
while tuning its context width and the number of attention heads. Using the
best settings, our system achieves notable improvement to the state-of-the-art
in ALT and provides a new baseline for the task.
Related papers
- Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - DeepFry: Identifying Vocal Fry Using Deep Neural Networks [16.489251286870704]
Vocal fry or creaky voice refers to a voice quality characterized by irregular glottal opening and low pitch.
Due to its irregular periodicity, creaky voice challenges automatic speech processing and recognition systems.
This paper proposes a deep learning model to detect creaky voice in fluent speech.
arXiv Detail & Related papers (2022-03-31T13:23:24Z) - Learning the Beauty in Songs: Neural Singing Voice Beautifier [69.21263011242907]
We are interested in a novel task, singing voice beautifying (SVB)
Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre.
We introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task.
arXiv Detail & Related papers (2022-02-27T03:10:12Z) - Youling: an AI-Assisted Lyrics Creation System [72.00418962906083]
This paper demonstrates textitYouling, an AI-assisted lyrics creation system, designed to collaborate with music creators.
In the lyrics generation process, textitYouling supports traditional one pass full-text generation mode as well as an interactive generation mode.
The system also provides a revision module which enables users to revise undesired sentences or words of lyrics repeatedly.
arXiv Detail & Related papers (2022-01-18T03:57:04Z) - MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics
Transcription [8.669338893753885]
This paper makes several contributions to automatic lyrics transcription (ALT) research.
Our main contribution is a novel variant of the Multistreaming Time-Delay Neural Network (MTDNN) architecture, called MSTRE-Net.
We present a new test set with a considerably larger size and a higher musical variability compared to the existing datasets used in ALT.
arXiv Detail & Related papers (2021-08-05T13:59:11Z) - Acoustics Based Intent Recognition Using Discovered Phonetic Units for
Low Resource Languages [51.0542215642794]
We propose a novel acoustics based intent recognition system that uses discovered phonetic units for intent classification.
We present results for two languages families - Indic languages and Romance languages, for two different intent recognition tasks.
arXiv Detail & Related papers (2020-11-07T00:35:31Z) - Unsupervised Pattern Discovery from Thematic Speech Archives Based on
Multilingual Bottleneck Features [41.951988293049205]
We propose a two-stage approach, which comprises unsupervised acoustic modeling and decoding, followed by pattern mining in acoustic unit sequences.
The proposed system is able to effectively extract topic-related words and phrases from the lecture recordings on MIT OpenCourseWare.
arXiv Detail & Related papers (2020-11-03T20:06:48Z) - Melody-Conditioned Lyrics Generation with SeqGANs [81.2302502902865]
We propose an end-to-end melody-conditioned lyrics generation system based on Sequence Generative Adversarial Networks (SeqGAN)
We show that the input conditions have no negative impact on the evaluation metrics while enabling the network to produce more meaningful results.
arXiv Detail & Related papers (2020-10-28T02:35:40Z) - DeepSinger: Singing Voice Synthesis with Data Mined From the Web [194.10598657846145]
DeepSinger is a multi-lingual singing voice synthesis system built from scratch using singing training data mined from music websites.
We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages.
arXiv Detail & Related papers (2020-07-09T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.