MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics
Transcription
- URL: http://arxiv.org/abs/2108.02625v1
- Date: Thu, 5 Aug 2021 13:59:11 GMT
- Title: MSTRE-Net: Multistreaming Acoustic Modeling for Automatic Lyrics
Transcription
- Authors: Emir Demirel, Sven Ahlb\"ack, Simon Dixon
- Abstract summary: This paper makes several contributions to automatic lyrics transcription (ALT) research.
Our main contribution is a novel variant of the Multistreaming Time-Delay Neural Network (MTDNN) architecture, called MSTRE-Net.
We present a new test set with a considerably larger size and a higher musical variability compared to the existing datasets used in ALT.
- Score: 8.669338893753885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper makes several contributions to automatic lyrics transcription
(ALT) research. Our main contribution is a novel variant of the Multistreaming
Time-Delay Neural Network (MTDNN) architecture, called MSTRE-Net, which
processes the temporal information using multiple streams in parallel with
varying resolutions keeping the network more compact, and thus with a faster
inference and an improved recognition rate than having identical TDNN streams.
In addition, two novel preprocessing steps prior to training the acoustic model
are proposed. First, we suggest using recordings from both monophonic and
polyphonic domains during training the acoustic model. Second, we tag
monophonic and polyphonic recordings with distinct labels for discriminating
non-vocal silence and music instances during alignment. Moreover, we present a
new test set with a considerably larger size and a higher musical variability
compared to the existing datasets used in ALT literature, while maintaining the
gender balance of the singers. Our best performing model sets the
state-of-the-art in lyrics transcription by a large margin. For
reproducibility, we publicly share the identifiers to retrieve the data used in
this paper.
Related papers
- C3LLM: Conditional Multimodal Content Generation Using Large Language Models [66.11184017840688]
We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
arXiv Detail & Related papers (2024-05-25T09:10:12Z) - Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music
Transcription [19.228155694144995]
Timbre-Trap is a novel framework which unifies music transcription and audio reconstruction.
We train a single autoencoder to simultaneously estimate pitch salience and reconstruct complex spectral coefficients.
We demonstrate that the framework leads to performance comparable to state-of-the-art instrument-agnostic transcription methods.
arXiv Detail & Related papers (2023-09-27T15:19:05Z) - Multitrack Music Transcription with a Time-Frequency Perceiver [6.617487928813374]
Multitrack music transcription aims to transcribe a music audio input into the musical notes of multiple instruments simultaneously.
We propose a novel deep neural network architecture, Perceiver TF, to model the time-frequency representation of audio input for multitrack transcription.
arXiv Detail & Related papers (2023-06-19T08:58:26Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - Efficient Neural Music Generation [42.39082326446739]
We present MeLoDy (M for music; L for LM; D for diffusion), an LM-guided diffusion model that generates music audios of state-of-the-art quality.
MeLoDy inherits the highest-level LM from MusicLM for semantic modeling, and applies a novel dual-path diffusion (DPD) model and an audio VAE-GAN to efficiently decode the conditioning semantic tokens into waveform.
arXiv Detail & Related papers (2023-05-25T05:02:35Z) - Musical Voice Separation as Link Prediction: Modeling a Musical
Perception Task as a Multi-Trajectory Tracking Problem [6.617487928813374]
This paper targets the perceptual task of separating the different interacting voices, i.e., monophonic melodic streams, in a polyphonic musical piece.
We model this task as a Multi-Trajectory Tracking (MTT) problem from discrete observations, i.e. notes in a pitch-time space.
Our approach builds a graph from a musical piece, by creating one node for every note, and separates the melodic trajectories by predicting a link between two notes if they are consecutive in the same voice/stream.
arXiv Detail & Related papers (2023-04-28T13:48:00Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR [77.82653227783447]
We propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network.
As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task.
arXiv Detail & Related papers (2022-03-01T05:02:02Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - Automatic Lyrics Transcription using Dilated Convolutional Neural
Networks with Self-Attention [11.232541198648159]
We have trained convolutional time-delay neural networks with self-attention on monophonic karaoke recordings.
Our system achieves notable improvement to the state-of-the-art in automatic lyrics transcription.
arXiv Detail & Related papers (2020-07-13T16:36:30Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.