AudioInceptionNeXt: TCL AI LAB Submission to EPIC-SOUND
Audio-Based-Interaction-Recognition Challenge 2023
- URL: http://arxiv.org/abs/2307.07265v1
- Date: Fri, 14 Jul 2023 10:39:05 GMT
- Title: AudioInceptionNeXt: TCL AI LAB Submission to EPIC-SOUND
Audio-Based-Interaction-Recognition Challenge 2023
- Authors: Kin Wai Lau, Yasar Abbas Ur Rehman, Yuyang Xie, Lan Ma
- Abstract summary: This report presents the technical details of our submission to the 2023 Epic-Kitchen EPIC-SOUNDS Audio-Based Interaction Recognition Challenge.
The task is to learn the mapping from audio samples to their corresponding action labels.
Our approach achieved 55.43% of top-1 accuracy on the challenge test set, ranked as 1st on the public leaderboard.
- Score: 5.0169092839789275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report presents the technical details of our submission to the 2023
Epic-Kitchen EPIC-SOUNDS Audio-Based Interaction Recognition Challenge. The
task is to learn the mapping from audio samples to their corresponding action
labels. To achieve this goal, we propose a simple yet effective single-stream
CNN-based architecture called AudioInceptionNeXt that operates on the
time-frequency log-mel-spectrogram of the audio samples. Motivated by the
design of the InceptionNeXt, we propose parallel multi-scale depthwise
separable convolutional kernels in the AudioInceptionNeXt block, which enable
the model to learn the time and frequency information more effectively. The
large-scale separable kernels capture the long duration of activities and the
global frequency semantic information, while the small-scale separable kernels
capture the short duration of activities and local details of frequency
information. Our approach achieved 55.43% of top-1 accuracy on the challenge
test set, ranked as 1st on the public leaderboard. Codes are available
anonymously at https://github.com/StevenLauHKHK/AudioInceptionNeXt.git.
Related papers
- Progressive Confident Masking Attention Network for Audio-Visual Segmentation [8.591836399688052]
A challenging problem known as Audio-Visual has emerged, intending to produce segmentation maps for sounding objects within a scene.
We introduce a novel Progressive Confident Masking Attention Network (PMCANet)
It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z) - TIM: A Time Interval Machine for Audio-Visual Action Recognition [64.24297230981168]
We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events.
We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder.
We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE.
arXiv Detail & Related papers (2024-04-08T14:30:42Z) - AVicuna: Audio-Visual LLM with Interleaver and Context-Boundary Alignment for Temporal Referential Dialogue [35.603271710124424]
We introduce PU-VALOR, an extensive audio-visual dataset comprising over 114,000 untrimmed videos with accurate temporal demarcations.
We also present AVicuna, featuring an Audio-Visual Tokens Interleaver (AVTI) that ensures the temporal alignment of audio-visual information.
arXiv Detail & Related papers (2024-03-24T19:50:49Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - LEAN: Light and Efficient Audio Classification Network [1.5070398746522742]
We propose a lightweight on-device deep learning-based model for audio classification, LEAN.
LEAN consists of a raw waveform-based temporal feature extractor called as Wave realignment and logmel-based Pretrained YAMNet.
We show that using a combination of trainable wave encoder, Pretrained YAMNet along with cross attention-based temporal realignment, results in competitive performance on downstream audio classification tasks with lesser memory footprints.
arXiv Detail & Related papers (2023-05-22T04:45:04Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Real-time Speaker counting in a cocktail party scenario using
Attention-guided Convolutional Neural Network [60.99112031408449]
We propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech.
The proposed system extracts higher-level information from the speech spectral content using a CNN model.
Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling.
arXiv Detail & Related papers (2021-10-30T19:24:57Z) - A Multi-View Approach To Audio-Visual Speaker Verification [38.9710777250597]
In this study, we explore audio-visual approaches to speaker verification.
We report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset.
This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.
arXiv Detail & Related papers (2021-02-11T22:29:25Z) - VGGSound: A Large-scale Audio-Visual Dataset [160.1604237188594]
We propose a scalable pipeline to create an audio dataset from open-source media.
We use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes.
The resulting dataset can be used for training and evaluating audio recognition models.
arXiv Detail & Related papers (2020-04-29T17:46:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.