Transformer-based Sequence Labeling for Audio Classification based on
MFCCs
- URL: http://arxiv.org/abs/2305.00417v2
- Date: Wed, 5 Jul 2023 05:50:38 GMT
- Title: Transformer-based Sequence Labeling for Audio Classification based on
MFCCs
- Authors: C. S. Sonali, Chinmayi B S, Ahana Balasubramanian
- Abstract summary: This paper proposes a Transformer-encoder-based model for audio classification using MFCCs.
The model was benchmarked against the ESC-50, Speech Commands v0.02 and UrbanSound8k datasets and has shown strong performance.
The model consisted of a mere 127,544 total parameters, making it light-weight yet highly efficient at the audio classification task.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio classification is vital in areas such as speech and music recognition.
Feature extraction from the audio signal, such as Mel-Spectrograms and MFCCs,
is a critical step in audio classification. These features are transformed into
spectrograms for classification. Researchers have explored various techniques,
including traditional machine and deep learning methods to classify
spectrograms, but these can be computationally expensive. To simplify this
process, a more straightforward approach inspired by sequence classification in
NLP can be used. This paper proposes a Transformer-encoder-based model for
audio classification using MFCCs. The model was benchmarked against the ESC-50,
Speech Commands v0.02 and UrbanSound8k datasets and has shown strong
performance, with the highest accuracy of 95.2% obtained upon training the
model on the UrbanSound8k dataset. The model consisted of a mere 127,544 total
parameters, making it light-weight yet highly efficient at the audio
classification task.
Related papers
- Machine Learning Framework for Audio-Based Content Evaluation using MFCC, Chroma, Spectral Contrast, and Temporal Feature Engineering [0.0]
We construct a dataset containing audio samples from music covers on YouTube along with the audio of the original song, and sentiment scores derived from user comments.
Our approach involves extensive pre-processing, segmenting audio signals into 30-second windows, and extracting high-dimensional feature representations.
We train regression models to predict sentiment scores on a 0-100 scale, achieving root mean square error (RMSE) values of 3.420, 5.482, 2.783, and 4.212, respectively.
arXiv Detail & Related papers (2024-10-31T20:26:26Z) - Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification.
The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders.
During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z) - Studying the Effect of Audio Filters in Pre-Trained Models for Environmental Sound Classification [0.0]
We have used various CNN models to learn audio features from different audio features like log mel spectrograms, gammatone spectral coefficients, mel spectral coefficients, generated from the audio files, over the past years.
In this paper, we propose a new methodology : Two-Level Classification; the Level 1 will be responsible to classify the audio signal into a broader class and the Level 2s will be responsible to find the actual class to which the audio belongs.
We have also shown the effects of different audio filters, among which a new method of Audio Crop is introduced in this paper, which gave the highest accu
arXiv Detail & Related papers (2024-08-24T18:13:07Z) - AudioFormer: Audio Transformer learns audio feature representations from
discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes.
Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years.
We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers.
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Low-complexity deep learning frameworks for acoustic scene
classification [64.22762153453175]
We present low-complexity deep learning frameworks for acoustic scene classification (ASC)
The proposed frameworks can be separated into four main steps: Front-end spectrogram extraction, online data augmentation, back-end classification, and late fusion of predicted probabilities.
Our experiments conducted on DCASE 2022 Task 1 Development dataset have fullfiled the requirement of low-complexity and achieved the best classification accuracy of 60.1%.
arXiv Detail & Related papers (2022-06-13T11:41:39Z) - SoundCLR: Contrastive Learning of Representations For Improved
Environmental Sound Classification [0.6767885381740952]
SoundCLR is a supervised contrastive learning method for effective environment sound classification with state-of-the-art performance.
Due to the comparatively small sizes of the available environmental sound datasets, we propose and exploit a transfer learning and strong data augmentation pipeline.
Our experiments show that our masking based augmentation technique on the log-mel spectrograms can significantly improve the recognition performance.
arXiv Detail & Related papers (2021-03-02T18:42:45Z) - Deep Convolutional and Recurrent Networks for Polyphonic Instrument
Classification from Monophonic Raw Audio Waveforms [30.3491261167433]
Sound Event Detection and Audio Classification tasks are traditionally addressed through time-frequency representations of audio signals such as spectrograms.
Deep neural networks as efficient feature extractors has enabled the direct use of audio signals for classification purposes.
We attempt to recognize musical instruments in polyphonic audio by only feeding their raw waveforms into deep learning models.
arXiv Detail & Related papers (2021-02-13T13:44:46Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.