Multi-Source Transformer Architectures for Audiovisual Scene
Classification
- URL: http://arxiv.org/abs/2210.10212v1
- Date: Tue, 18 Oct 2022 23:42:42 GMT
- Title: Multi-Source Transformer Architectures for Audiovisual Scene
Classification
- Authors: Wim Boes, Hugo Van hamme
- Abstract summary: The systems we submitted for subtask 1B of the DCASE 2021 challenge, regarding audiovisual scene classification, are described in detail.
They are essentially multi-source transformers employing a combination of auditory and visual features to make predictions.
- Score: 14.160670979300628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this technical report, the systems we submitted for subtask 1B of the
DCASE 2021 challenge, regarding audiovisual scene classification, are described
in detail. They are essentially multi-source transformers employing a
combination of auditory and visual features to make predictions. These models
are evaluated utilizing the macro-averaged multi-class cross-entropy and
accuracy metrics.
In terms of the macro-averaged multi-class cross-entropy, our best model
achieved a score of 0.620 on the validation data. This is slightly better than
the performance of the baseline system (0.658).
With regard to the accuracy measure, our best model achieved a score of
77.1\% on the validation data, which is about the same as the performance
obtained by the baseline system (77.0\%).
Related papers
- Self-DenseMobileNet: A Robust Framework for Lung Nodule Classification using Self-ONN and Stacking-based Meta-Classifier [1.2300841481611335]
Self-DenseMobileNet is designed to enhance the classification of nodules and non-nodules in chest radiographs (CXRs)
Our framework integrates advanced image standardization and enhancement techniques to optimize the input quality.
When tested on an external dataset, the framework maintained strong generalizability with an accuracy of 89.40%.
arXiv Detail & Related papers (2024-10-16T14:04:06Z) - Music Genre Classification using Large Language Models [50.750620612351284]
This paper exploits the zero-shot capabilities of pre-trained large language models (LLMs) for music genre classification.
The proposed approach splits audio signals into 20 ms chunks and processes them through convolutional feature encoders.
During inference, predictions on individual chunks are aggregated for a final genre classification.
arXiv Detail & Related papers (2024-10-10T19:17:56Z) - Evaluation of Speech Representations for MOS prediction [0.7329200485567826]
In this paper, we evaluate feature extraction models for predicting speech quality.
We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models.
arXiv Detail & Related papers (2023-06-16T17:21:42Z) - Domain Adaptation of Transformer-Based Models using Unlabeled Data for
Relevance and Polarity Classification of German Customer Feedback [1.2999413717930817]
This work explores how efficient transformer-based models are when working with a German customer feedback dataset.
The experimental results show that transformer-based models can reach significant improvements compared to a fastText baseline.
arXiv Detail & Related papers (2022-12-12T08:32:28Z) - The ReturnZero System for VoxCeleb Speaker Recognition Challenge 2022 [0.0]
We describe the top-scoring submissions for team RTZR VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
The top performed system is a fusion of 7 models, which contains 3 different types of model architectures.
The final submission achieves 0.165 DCF and 2.912% EER on the VoxSRC22 test set.
arXiv Detail & Related papers (2022-09-21T06:54:24Z) - Using Rater and System Metadata to Explain Variance in the VoiceMOS
Challenge 2022 Dataset [71.93633698146002]
The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels.
This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset.
arXiv Detail & Related papers (2022-09-14T00:45:49Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Improved Multiscale Vision Transformers for Classification and Detection [80.64111139883694]
We study Multiscale Vision Transformers (MViT) as a unified architecture for image and video classification, as well as object detection.
We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections.
We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition.
arXiv Detail & Related papers (2021-12-02T18:59:57Z) - VATT: Transformers for Multimodal Self-Supervised Learning from Raw
Video, Audio and Text [60.97904439526213]
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks.
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
arXiv Detail & Related papers (2021-04-22T17:07:41Z) - XD at SemEval-2020 Task 12: Ensemble Approach to Offensive Language
Identification in Social Media Using Transformer Encoders [17.14709845342071]
This paper presents six document classification models using the latest transformer encoders and a high-performing ensemble model for a task of offensive language identification in social media.
Our analysis shows that although the ensemble model significantly improves the accuracy on the development set, the improvement is not as evident on the test set.
arXiv Detail & Related papers (2020-07-21T17:03:00Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.