A Study of Few-Shot Audio Classification
- URL: http://arxiv.org/abs/2012.01573v1
- Date: Wed, 2 Dec 2020 22:19:16 GMT
- Title: A Study of Few-Shot Audio Classification
- Authors: Piper Wolters, Chris Careaga, Brian Hutchinson, Lauren Phillips
- Abstract summary: Few-shot learning is a type of machine learning designed to enable the model to generalize to new classes with very few examples.
We evaluate our model for speaker identification on the VoxCeleb dataset and ICSI Meeting Corpus, obtaining 5-shot 5-way accuracies of 93.5% and 54.0%, respectively.
We also evaluate for activity classification from audio using few-shot subsets of the Kinetics600 dataset and AudioSet, both drawn from Youtube videos, obtaining 51.5% and 35.2% accuracy, respectively.
- Score: 2.1989764549743476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in deep learning have resulted in state-of-the-art performance for
many audio classification tasks but, unlike humans, these systems traditionally
require large amounts of data to make accurate predictions. Not every person or
organization has access to those resources, and the organizations that do, like
our field at large, do not reflect the demographics of our country. Enabling
people to use machine learning without significant resource hurdles is
important, because machine learning is an increasingly useful tool for solving
problems, and can solve a broader set of problems when put in the hands of a
broader set of people. Few-shot learning is a type of machine learning designed
to enable the model to generalize to new classes with very few examples. In
this research, we address two audio classification tasks (speaker
identification and activity classification) with the Prototypical Network
few-shot learning algorithm, and assess performance of various encoder
architectures. Our encoders include recurrent neural networks, as well as one-
and two-dimensional convolutional neural networks. We evaluate our model for
speaker identification on the VoxCeleb dataset and ICSI Meeting Corpus,
obtaining 5-shot 5-way accuracies of 93.5% and 54.0%, respectively. We also
evaluate for activity classification from audio using few-shot subsets of the
Kinetics~600 dataset and AudioSet, both drawn from Youtube videos, obtaining
51.5% and 35.2% accuracy, respectively.
Related papers
- Some voices are too common: Building fair speech recognition systems
using the Common Voice dataset [2.28438857884398]
We use the French Common Voice dataset to quantify the biases of a pre-trained wav2vec2.0 model toward several demographic groups.
We also run an in-depth analysis of the Common Voice corpus and identify important shortcomings that should be taken into account.
arXiv Detail & Related papers (2023-06-01T11:42:34Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Deep Feature Learning for Medical Acoustics [78.56998585396421]
The purpose of this paper is to compare different learnables in medical acoustics tasks.
A framework has been implemented to classify human respiratory sounds and heartbeats in two categories, i.e. healthy or affected by pathologies.
arXiv Detail & Related papers (2022-08-05T10:39:37Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Computing Class Hierarchies from Classifiers [12.631679928202516]
We propose a novel algorithm for automatically acquiring a class hierarchy from a neural network.
Our algorithm produces surprisingly good hierarchies for some well-known deep neural network models.
arXiv Detail & Related papers (2021-12-02T13:01:04Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z) - Knowing What to Listen to: Early Attention for Deep Speech
Representation Learning [25.71206255965502]
We propose the novel Fine-grained Early Attention (FEFA) for speech signals.
This model is capable of focusing on information items as small as frequency bins.
We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition.
arXiv Detail & Related papers (2020-09-03T17:40:27Z) - Device-Robust Acoustic Scene Classification Based on Two-Stage
Categorization and Data Augmentation [63.98724740606457]
We present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge.
Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes.
Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions.
arXiv Detail & Related papers (2020-07-16T15:07:14Z) - CURE Dataset: Ladder Networks for Audio Event Classification [15.850545634216484]
There are approximately 3M people with hearing loss who can't perceive events happening around them.
This paper establishes the CURE dataset which contains curated set of specific audio events most relevant for people with hearing loss.
arXiv Detail & Related papers (2020-01-12T09:35:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.