AudioInceptionNeXt: TCL AI LAB Submission to EPIC-SOUND
Audio-Based-Interaction-Recognition Challenge 2023
- URL: http://arxiv.org/abs/2307.07265v1
- Date: Fri, 14 Jul 2023 10:39:05 GMT
- Title: AudioInceptionNeXt: TCL AI LAB Submission to EPIC-SOUND
Audio-Based-Interaction-Recognition Challenge 2023
- Authors: Kin Wai Lau, Yasar Abbas Ur Rehman, Yuyang Xie, Lan Ma
- Abstract summary: This report presents the technical details of our submission to the 2023 Epic-Kitchen EPIC-SOUNDS Audio-Based Interaction Recognition Challenge.
The task is to learn the mapping from audio samples to their corresponding action labels.
Our approach achieved 55.43% of top-1 accuracy on the challenge test set, ranked as 1st on the public leaderboard.
- Score: 5.0169092839789275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report presents the technical details of our submission to the 2023
Epic-Kitchen EPIC-SOUNDS Audio-Based Interaction Recognition Challenge. The
task is to learn the mapping from audio samples to their corresponding action
labels. To achieve this goal, we propose a simple yet effective single-stream
CNN-based architecture called AudioInceptionNeXt that operates on the
time-frequency log-mel-spectrogram of the audio samples. Motivated by the
design of the InceptionNeXt, we propose parallel multi-scale depthwise
separable convolutional kernels in the AudioInceptionNeXt block, which enable
the model to learn the time and frequency information more effectively. The
large-scale separable kernels capture the long duration of activities and the
global frequency semantic information, while the small-scale separable kernels
capture the short duration of activities and local details of frequency
information. Our approach achieved 55.43% of top-1 accuracy on the challenge
test set, ranked as 1st on the public leaderboard. Codes are available
anonymously at https://github.com/StevenLauHKHK/AudioInceptionNeXt.git.
Related papers
- Neurobench: DCASE 2020 Acoustic Scene Classification benchmark on XyloAudio 2 [0.06752396542927405]
XyloAudio is a line of ultra-low-power audio inference chips.
It is designed for in- and near-microphone analysis of audio in real-time energy-constrained scenarios.
arXiv Detail & Related papers (2024-10-31T09:48:12Z) - DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection [16.92604848450722]
This paper describes sound event localization and detection (SELD) for spatial audio recordings captured by firstorder ambisonics (FOA) microphones.
We propose a novel method of pretraining the feature extraction part of the deep neural network (DNN) in a self-supervised manner.
arXiv Detail & Related papers (2024-10-30T08:31:58Z) - Progressive Confident Masking Attention Network for Audio-Visual Segmentation [8.591836399688052]
A challenging problem known as Audio-Visual has emerged, intending to produce segmentation maps for sounding objects within a scene.
We introduce a novel Progressive Confident Masking Attention Network (PMCANet)
It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z) - TIM: A Time Interval Machine for Audio-Visual Action Recognition [64.24297230981168]
We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events.
We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder.
We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE.
arXiv Detail & Related papers (2024-04-08T14:30:42Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Epic-Sounds: A Large-scale Dataset of Actions That Sound [64.24297230981168]
Epic-Sounds is a large-scale dataset of audio annotations capturing temporal extents and class labels.
We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes.
Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.
arXiv Detail & Related papers (2023-02-01T18:19:37Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - A Multi-View Approach To Audio-Visual Speaker Verification [38.9710777250597]
In this study, we explore audio-visual approaches to speaker verification.
We report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset.
This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.
arXiv Detail & Related papers (2021-02-11T22:29:25Z) - VGGSound: A Large-scale Audio-Visual Dataset [160.1604237188594]
We propose a scalable pipeline to create an audio dataset from open-source media.
We use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes.
The resulting dataset can be used for training and evaluating audio recognition models.
arXiv Detail & Related papers (2020-04-29T17:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.