Related papers: AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

URL: http://arxiv.org/abs/2309.10787v2
Date: Tue, 19 Mar 2024 08:51:51 GMT
Title: AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
Authors: Yuan Tseng, Layne Berry, Yi-Ting Chen, I-Hsiang Chiu, Hsuan-Hao Lin, Max Liu, Puyuan Peng, Yi-Jen Shih, Hung-Yu Wang, Haibin Wu, Po-Yao Huang, Chun-Mao Lai, Shang-Wen Li, David Harwath, Yu Tsao, Shinji Watanabe, Abdelrahman Mohamed, Chi-Luen Feng, Hung-yi Lee,
Abstract summary: We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks. We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
Score: 92.92233932921741
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks, emphasizing the need for future study on improving universal model performance. In addition, we show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task. We release our benchmark with evaluation code and a model submission platform to encourage further research in audio-visual learning.

Related papers

MAVERIX: Multimodal Audio-Visual Evaluation Reasoning IndeX [15.038202110401336]
MAVERIX(Multimodal Audio-Visual Evaluation Reasoning IndeX) is a novel benchmark with 700 videos and 2,556 questions. It is designed to evaluate multimodal models through tasks that necessitate close integration of video and audio information. Experiments with state-of-the-art models, including Gemini 1.5 Pro and o1, show performance approaching human levels.
arXiv Detail & Related papers (2025-03-27T17:04:33Z)
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [65.49972312524724]
multimodal large language models (MLLMs) have expanded their capabilities to include vision and audio modalities. Our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial. We introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information.
arXiv Detail & Related papers (2024-12-03T17:41:23Z)
AudioBench: A Universal Benchmark for Audio Large Language Models [41.46064884020139]
We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs) It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets. The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic)
arXiv Detail & Related papers (2024-06-23T05:40:26Z)
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z)
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z)
MAViL: Masked Audio-Video Learners [68.61844803682145]
We present Masked Audio-Video learners (MAViL) to train audio-visual representations. Pre-training with MAViL enables the model to perform well in audio-visual classification and retrieval tasks. For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on benchmarks.
arXiv Detail & Related papers (2022-12-15T18:59:59Z)
Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z)
Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks. On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z)
Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields. We discuss state-of-the-art methods as well as the remaining challenges of each subfield. We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.