Team AcieLee: Technical Report for EPIC-SOUNDS Audio-Based Interaction
Recognition Challenge 2023
- URL: http://arxiv.org/abs/2306.08998v1
- Date: Thu, 15 Jun 2023 09:49:07 GMT
- Title: Team AcieLee: Technical Report for EPIC-SOUNDS Audio-Based Interaction
Recognition Challenge 2023
- Authors: Yuqi Li, Yizhi Luo, Xiaoshuai Hao, Chuanguang Yang, Zhulin An, Dantong
Song, Wei Yi
- Abstract summary: The task is to classify the audio caused by interactions between objects, or from events of the camera wearer.
We conducted exhaustive experiments and found learning rate step decay, backbone frozen, label smoothing and focal loss contribute most to the performance improvement.
This proposed method allowed us to achieve 3rd place in the CVPR 2023 workshop of EPIC-SOUNDS Audio-Based Interaction Recognition Challenge.
- Score: 8.699868810184752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we describe the technical details of our submission to the
EPIC-SOUNDS Audio-Based Interaction Recognition Challenge 2023, by Team
"AcieLee" (username: Yuqi\_Li). The task is to classify the audio caused by
interactions between objects, or from events of the camera wearer. We conducted
exhaustive experiments and found learning rate step decay, backbone frozen,
label smoothing and focal loss contribute most to the performance improvement.
After training, we combined multiple models from different stages and
integrated them into a single model by assigning fusion weights. This proposed
method allowed us to achieve 3rd place in the CVPR 2023 workshop of EPIC-SOUNDS
Audio-Based Interaction Recognition Challenge.
Related papers
- First Place Solution to the ECCV 2024 ROAD++ Challenge @ ROAD++ Atomic Activity Recognition 2024 [5.674251666234644]
This report presents our team's technical solution for participating in Track 3 of the 2024 ECCV ROAD++ Challenge.
The task of Track 3 is atomic activity recognition, which aims to identify 64 types of atomic activities in road scenes based on video content.
Our approach primarily addresses the challenges of small objects, discriminating between single object and a group of objects, as well as model overfitting.
arXiv Detail & Related papers (2024-10-30T15:06:58Z) - Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better [9.378013909890374]
We present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024)
To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy.
Our model ranks textbf2nd in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.
arXiv Detail & Related papers (2024-09-12T05:05:34Z) - Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z) - MIPI 2024 Challenge on Few-shot RAW Image Denoising: Methods and Results [105.4843037899554]
We summarize and review the Few-shot RAW Image Denoising track on MIPI 2024.
165 participants were successfully registered, and 7 teams submitted results in the final testing phase.
The developed solutions in this challenge achieved state-of-the-art erformance on Few-shot RAW Image Denoising.
arXiv Detail & Related papers (2024-06-11T06:59:55Z) - 3rd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation [63.199793919573295]
Video Object (VOS) is a vital task in computer vision, focusing on distinguishing foreground objects from the background across video frames.
Our work draws inspiration from the Cutie model, and we investigate the effects of object memory, the total number of memory frames, and input resolution on segmentation performance.
arXiv Detail & Related papers (2024-06-06T00:56:25Z) - Overview of the L3DAS23 Challenge on Audio-Visual Extended Reality [15.034352805342937]
The primary goal of the L3DAS23 Signal Processing Grand Challenge at ICASSP 2023 is to promote and support collaborative research on machine learning for 3D audio signal processing.
We provide a brand-new dataset, which maintains the same general characteristics of the L3DAS21 and L3DAS22 datasets.
We propose updated baseline models for both tasks that can now support audio-image couples as input and a supporting API to replicate our results.
arXiv Detail & Related papers (2024-02-14T15:34:28Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Perception Test 2023: A Summary of the First Challenge And Outcome [67.0525378209708]
The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023.
The goal was to benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark.
We summarise in this report the task descriptions, metrics, baselines, and results.
arXiv Detail & Related papers (2023-12-20T15:12:27Z) - AudioInceptionNeXt: TCL AI LAB Submission to EPIC-SOUND
Audio-Based-Interaction-Recognition Challenge 2023 [5.0169092839789275]
This report presents the technical details of our submission to the 2023 Epic-Kitchen EPIC-SOUNDS Audio-Based Interaction Recognition Challenge.
The task is to learn the mapping from audio samples to their corresponding action labels.
Our approach achieved 55.43% of top-1 accuracy on the challenge test set, ranked as 1st on the public leaderboard.
arXiv Detail & Related papers (2023-07-14T10:39:05Z) - Self-supervised Contrastive Learning for Audio-Visual Action Recognition [7.188231323934023]
The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos.
We propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (A), to learn discriminative audio-visual representations for action recognition.
arXiv Detail & Related papers (2022-04-28T10:01:36Z) - Cyclic Co-Learning of Sounding Object Visual Grounding and Sound
Separation [52.550684208734324]
We propose a cyclic co-learning paradigm that can jointly learn sounding object visual grounding and audio-visual sound separation.
In this paper, we show that the proposed framework outperforms the compared recent approaches on both tasks.
arXiv Detail & Related papers (2021-04-05T17:30:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.