Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024
- URL: http://arxiv.org/abs/2409.19595v1
- Date: Sun, 29 Sep 2024 07:28:21 GMT
- Title: Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024
- Authors: Haowei Gu, Weihao Zhu, Yang Yang,
- Abstract summary: This report proposes an improved method for the Temporal Sound Localisation task.
It localizes and classifies the sound events occurring in the video according to a predefined set of sound classes.
Our approach ranks first in the final test with a score of 0.4925.
- Score: 3.4947857354806633
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report proposes an improved method for the Temporal Sound Localisation (TSL) task, which localizes and classifies the sound events occurring in the video according to a predefined set of sound classes. The champion solution from last year's first competition has explored the TSL by fusing audio and video modalities with the same weight. Considering the TSL task aims to localize sound events, we conduct relevant experiments that demonstrated the superiority of sound features (Section 3). Based on our findings, to enhance audio modality features, we employ various models to extract audio features, such as InterVideo, CaVMAE, and VideoMAE models. Our approach ranks first in the final test with a score of 0.4925.
Related papers
- Leveraging Reverberation and Visual Depth Cues for Sound Event Localization and Detection with Distance Estimation [3.2472293599354596]
This report describes our systems submitted for the DCASE2024 Task 3 challenge: Audio and Audiovisual Sound Event localization and Detection with Source Distance Estimation (Track B)
Our main model is based on the audio-visual (AV) Conformer, which processes video and audio embeddings extracted with ResNet50 and with an audio encoder pre-trained on SELD, respectively.
This model outperformed the audio-visual baseline of the development set of the STARSS23 dataset by a wide margin, halving its DOAE and improving the F1 by more than 3x.
arXiv Detail & Related papers (2024-10-29T17:28:43Z) - Can Large Audio-Language Models Truly Hear? Tackling Hallucinations with Multi-Task Assessment and Stepwise Audio Reasoning [55.2480439325792]
Large audio-language models (LALMs) have shown impressive capabilities in understanding and reasoning about audio and speech information.
These models still face challenges, including hallucinating non-existent sound events, misidentifying the order of sound events, and incorrectly attributing sound sources.
arXiv Detail & Related papers (2024-10-21T15:55:27Z) - The Solution for Temporal Sound Localisation Task of ICCV 1st Perception Test Challenge 2023 [11.64675515432159]
We employ a multimodal fusion approach to combine visual and audio features.
High-quality visual features are extracted using a state-of-the-art self-supervised pre-training network.
At the same time, audio features serve as complementary information to help the model better localize the start and end of sounds.
arXiv Detail & Related papers (2024-07-01T12:52:05Z) - EAT: Self-Supervised Pre-Training with Efficient Audio Transformer [2.443213094810588]
Efficient Audio Transformer (EAT) is inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality.
A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events.
Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks.
arXiv Detail & Related papers (2024-01-07T14:31:27Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues.
We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks.
Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale
Benchmark and Baseline [53.07236039168652]
We focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video.
We introduce the first Untrimmed Audio-Visual dataset, which contains 10K untrimmed videos with over 30K audio-visual events.
Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass.
arXiv Detail & Related papers (2023-03-22T22:00:17Z) - BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years.
We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers.
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z) - Dual Normalization Multitasking for Audio-Visual Sounding Object
Localization [0.0]
We propose a new concept, Sounding Object, to reduce the ambiguity of the visual location of sound.
To tackle this new AVSOL problem, we propose a novel multitask training strategy and architecture called Dual Normalization Multitasking.
arXiv Detail & Related papers (2021-06-01T02:02:52Z) - Generating Visually Aligned Sound from Videos [83.89485254543888]
We focus on the task of generating sound from natural videos.
The sound should be both temporally and content-wise aligned with visual signals.
Some sounds generated outside of a camera can not be inferred from video content.
arXiv Detail & Related papers (2020-07-14T07:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.