Music Genre Classification using Large Language Models
- URL: http://arxiv.org/abs/2410.08321v1
- Date: Thu, 10 Oct 2024 19:17:56 GMT
- Title: Music Genre Classification using Large Language Models
- Authors: Mohamed El Amine Meguenani, Alceu de Souza Britto Jr., Alessandro Lameiras Koerich,
- Abstract summary: This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification.
The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders.
During inference, predictions on individual chunks are aggregated for a final genre classification.
- Score: 50.750620612351284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification. The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders, a transformer encoder, and additional layers for coding audio units and generating feature vectors. The extracted feature vectors are used to train a classification head. During inference, predictions on individual chunks are aggregated for a final genre classification. We conducted a comprehensive comparison of LLMs, including WavLM, HuBERT, and wav2vec 2.0, with traditional deep learning architectures like 1D and 2D convolutional neural networks (CNNs) and the audio spectrogram transformer (AST). Our findings demonstrate the superior performance of the AST model, achieving an overall accuracy of 85.5%, surpassing all other models evaluated. These results highlight the potential of LLMs and transformer-based architectures for advancing music information retrieval tasks, even in zero-shot scenarios.
Related papers
- WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [65.30937248905958]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.
We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.
WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z) - Music Genre Classification: Training an AI model [0.0]
Music genre classification is an area that utilizes machine learning models and techniques for the processing of audio signals.
In this research I explore various machine learning algorithms for the purpose of music genre classification, using features extracted from audio signals.
I aim to asses the robustness of machine learning models for genre classification, and to compare their results.
arXiv Detail & Related papers (2024-05-23T23:07:01Z) - Leveraging Pre-Trained Autoencoders for Interpretable Prototype Learning
of Music Audio [10.946347283718923]
We present PECMAE, an interpretable model for music audio classification based on prototype learning.
Our model is based on a previous method, APNet, which jointly learns an autoencoder and a prototypical network.
We find that the prototype-based models preserve most of the performance achieved with the autoencoder embeddings.
arXiv Detail & Related papers (2024-02-14T17:13:36Z) - Cascaded Cross-Modal Transformer for Audio-Textual Classification [30.643750999989233]
We propose to harness the inherent value of multimodal representations by transcribing speech using automatic speech recognition (ASR) models.
We thus obtain an audio-textual (multimodal) representation for each data sample.
We were declared the winning solution in the Requests Sub-Challenge of the ACM Multimedia 2023 Computational Paralinguistics Challenge.
arXiv Detail & Related papers (2024-01-15T10:18:08Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Audio Transformers:Transformer Architectures For Large Scale Audio
Understanding. Adieu Convolutions [6.370905925442655]
We propose applying Transformer based architectures without convolutional layers to raw audio signals.
Our model outperforms convolutional models to produce state of the art results.
We further improve the performance of Transformer architectures by using techniques such as pooling inspired from convolutional net-work.
arXiv Detail & Related papers (2021-05-01T19:38:30Z) - NAViDAd: A No-Reference Audio-Visual Quality Metric Based on a Deep
Autoencoder [0.0]
We propose a No-Reference Audio-Visual Quality Metric Based on a Deep Autoencoder (NAViDAd)
The model is formed by a 2-layer framework that includes a deep autoencoder layer and a classification layer.
The model performed well when tested against the UnB-AV and the LiveNetflix-II databases.
arXiv Detail & Related papers (2020-01-30T15:40:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.