LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT
- URL: http://arxiv.org/abs/2306.17103v4
- Date: Thu, 25 Jul 2024 06:15:20 GMT
- Title: LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by Whispering to ChatGPT
- Authors: Le Zhuo, Ruibin Yuan, Jiahao Pan, Yinghao Ma, Yizhi LI, Ge Zhang, Si Liu, Roger Dannenberg, Jie Fu, Chenghua Lin, Emmanouil Benetos, Wei Xue, Yike Guo,
- Abstract summary: We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method.
We use Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model.
Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English.
- Score: 48.28624219567131
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method achieving state-of-the-art performance on various lyrics transcription datasets, even in challenging genres such as rock and metal. Our novel, training-free approach utilizes Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model. In the proposed method, Whisper functions as the "ear" by transcribing the audio, while GPT-4 serves as the "brain," acting as an annotator with a strong performance for contextualized output selection and correction. Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English and can effectively transcribe lyrics across multiple languages. Furthermore, we use LyricWhiz to create the first publicly available, large-scale, multilingual lyrics transcription dataset with a CC-BY-NC-SA copyright license, based on MTG-Jamendo, and offer a human-annotated subset for noise level estimation and evaluation. We anticipate that our proposed method and dataset will advance the development of multilingual lyrics transcription, a challenging and emerging task.
Related papers
- Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text [22.19230427358921]
It is worth researching how to improve the performance of Whisper on under-represented languages.
We utilized easily accessible unpaired speech and text data and combined the language model GPT with Whisper on Kazakh.
We achieved more than 10% absolute WER reduction in multiple experiments.
arXiv Detail & Related papers (2024-08-10T13:39:13Z) - Jam-ALT: A Formatting-Aware Lyrics Transcription Benchmark [2.6297569393407416]
We introduce Jam-ALT, a new lyrics transcription benchmark based on the JamendoLyrics dataset.
First, a complete revision of the transcripts, geared specifically towards ALT evaluation.
Second, a suite of evaluation metrics designed, unlike the traditional word error rate, to capture such phenomena.
arXiv Detail & Related papers (2023-11-23T13:13:48Z) - Controllable Emphasis with zero data for text-to-speech [57.12383531339368]
A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word.
We show that this is significantly better than spectrogram modification techniques improving naturalness by $7.3%$ and correct testers' identification of the emphasised word in a sentence by $40%$ on a reference female en-US voice.
arXiv Detail & Related papers (2023-07-13T21:06:23Z) - A Phoneme-Informed Neural Network Model for Note-Level Singing
Transcription [11.951441023641975]
We propose a method of finding note onsets of singing voice more accurately by leveraging the linguistic characteristics of singing.
Our approach substantially improves the performance of singing transcription and emphasizes the importance of linguistic features in singing analysis.
arXiv Detail & Related papers (2023-04-12T15:36:01Z) - Translate the Beauty in Songs: Jointly Learning to Align Melody and
Translate Lyrics [38.35809268026605]
We propose Lyrics-Melody Translation with Adaptive Grouping (LTAG) as a holistic solution to automatic song translation.
It is a novel encoder-decoder framework that can simultaneously translate the source lyrics and determine the number of aligned notes at each decoding step.
Experiments conducted on an English-Chinese song translation data set show the effectiveness of our model in both automatic and human evaluation.
arXiv Detail & Related papers (2023-03-28T03:17:59Z) - Melody transcription via generative pre-training [86.08508957229348]
Key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles.
To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio.
We derive a new dataset containing $50$ hours of melody transcriptions from crowdsourced annotations of broad music.
arXiv Detail & Related papers (2022-12-04T18:09:23Z) - Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding [55.989376102986654]
This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech problem under the few-shot setting.
We propose a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space.
arXiv Detail & Related papers (2022-06-27T11:24:40Z) - Genre-conditioned Acoustic Models for Automatic Lyrics Transcription of
Polyphonic Music [73.73045854068384]
We propose to transcribe the lyrics of polyphonic music using a novel genre-conditioned network.
The proposed network adopts pre-trained model parameters, and incorporates the genre adapters between layers to capture different genre peculiarities for lyrics-genre pairs.
Our experiments show that the proposed genre-conditioned network outperforms the existing lyrics transcription systems.
arXiv Detail & Related papers (2022-04-07T09:15:46Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.