Contrastive Learning of Global and Local Audio-Visual Representations
- URL: http://arxiv.org/abs/2104.05418v1
- Date: Wed, 7 Apr 2021 07:35:08 GMT
- Title: Contrastive Learning of Global and Local Audio-Visual Representations
- Authors: Shuang Ma, Zhaoyang Zeng, Daniel McDuff, Yale Song
- Abstract summary: We propose a versatile self-supervised approach to learn audio-visual representations that generalizes to tasks that require global semantic information.
We show that our approach learns generalizable video representations on various downstream scenarios including action/sound classification, lip reading, deepfake detection, and sound source localization.
- Score: 25.557229705149577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive learning has delivered impressive results in many audio-visual
representation learning scenarios. However, existing approaches optimize for
learning either \textit{global} representations useful for tasks such as
classification, or \textit{local} representations useful for tasks such as
audio-visual source localization and separation. While they produce
satisfactory results in their intended downstream scenarios, they often fail to
generalize to tasks that they were not originally designed for. In this work,
we propose a versatile self-supervised approach to learn audio-visual
representations that generalize to both the tasks which require global semantic
information (e.g., classification) and the tasks that require fine-grained
spatio-temporal information (e.g. localization). We achieve this by optimizing
two cross-modal contrastive objectives that together encourage our model to
learn discriminative global-local visual information given audio signals. To
show that our approach learns generalizable video representations, we evaluate
it on various downstream scenarios including action/sound classification, lip
reading, deepfake detection, and sound source localization.
Related papers
- Audio-visual Generalized Zero-shot Learning the Easy Way [20.60905505473906]
We introduce EZ-AVGZL, which aligns audio-visual embeddings with transformed text representations.
We conduct extensive experiments on VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL benchmarks.
arXiv Detail & Related papers (2024-07-18T01:57:16Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - Prompting Segmentation with Sound Is Generalizable Audio-Visual Source
Localizer [22.846623384472377]
We introduce the encoder-prompt-decoder paradigm to decode localization from the fused audio-visual feature.
Specifically, we first propose to construct Semantic-aware Audio Prompt (SAP) to help the visual foundation model focus on sounding objects.
We develop a Correlation Adapter (ColA) to keep minimal training efforts as well as maintain adequate knowledge of the visual foundation model.
arXiv Detail & Related papers (2023-09-13T05:43:35Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios.
We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas.
Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z) - Discriminative Sounding Objects Localization via Self-supervised
Audiovisual Matching [87.42246194790467]
We propose a two-stage learning framework to perform self-supervised class-aware sounding object localization.
We show that our model is superior in filtering out silent objects and pointing out the location of sounding objects of different classes.
arXiv Detail & Related papers (2020-10-12T05:51:55Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Self-Supervised Learning of Audio-Visual Objects from Video [108.77341357556668]
We introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time.
We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks.
arXiv Detail & Related papers (2020-08-10T16:18:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.