A-JEPA: Joint-Embedding Predictive Architecture Can Listen
- URL: http://arxiv.org/abs/2311.15830v3
- Date: Thu, 11 Jan 2024 13:16:43 GMT
- Title: A-JEPA: Joint-Embedding Predictive Architecture Can Listen
- Authors: Zhengcong Fei, Mingyuan Fan, Junshi Huang
- Abstract summary: We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum.
A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations.
- Score: 35.308323314848735
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents that the masked-modeling principle driving the success of
large foundational vision models can be effectively applied to audio by making
predictions in a latent space. We introduce Audio-based Joint-Embedding
Predictive Architecture (A-JEPA), a simple extension method for self-supervised
learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA
encodes visible audio spectrogram patches with a curriculum masking strategy
via context encoder, and predicts the representations of regions sampled at
well-designed locations. The target representations of those regions are
extracted by the exponential moving average of context encoder, \emph{i.e.},
target encoder, on the whole spectrogram. We find it beneficial to transfer
random block masking into time-frequency aware masking in a curriculum manner,
considering the complexity of highly correlated in local time and frequency in
audio spectrograms. To enhance contextual semantic understanding and
robustness, we fine-tune the encoder with a regularized masking on target
datasets, instead of input dropping or zero. Empirically, when built with
Vision Transformers structure, we find A-JEPA to be highly scalable and sets
new state-of-the-art performance on multiple audio and speech classification
tasks, outperforming other recent models that use externally supervised
pre-training.
Related papers
- Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning [7.083341587100975]
Image-based Joint-Embedding Predictive Architecture (IJEPA) offers an attractive alternative to Masked Autoencoder (MAE)
IJEPA drives representations to capture useful semantic information by predicting in latent rather than input space.
Our "conditional" encoders show performance gains on several image classification benchmark datasets.
arXiv Detail & Related papers (2024-10-14T17:46:24Z) - How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks [14.338754598043968]
Two competing paradigms exist for self-supervised learning of data representations.
Joint Embedding Predictive Architecture (JEPA) is a class of architectures in which semantically similar inputs are encoded into representations that are predictive of each other.
arXiv Detail & Related papers (2024-07-03T19:43:12Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Masked Autoencoders that Listen [79.99280830830854]
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms.
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram.
arXiv Detail & Related papers (2022-07-13T17:59:55Z) - Multitask AET with Orthogonal Tangent Regularity for Dark Object
Detection [84.52197307286681]
We propose a novel multitask auto encoding transformation (MAET) model to enhance object detection in a dark environment.
In a self-supervision manner, the MAET learns the intrinsic visual structure by encoding and decoding the realistic illumination-degrading transformation.
We have achieved the state-of-the-art performance using synthetic and real-world datasets.
arXiv Detail & Related papers (2022-05-06T16:27:14Z) - Automatic Audio Captioning using Attention weighted Event based
Embeddings [25.258177951665594]
We propose an encoder-decoder architecture with light-weight (i.e. with lesser learnable parameters) Bi-LSTM recurrent layers for AAC.
Our results show that an efficient AED based embedding extractor combined with temporal attention and augmentation techniques is able to surpass existing literature.
arXiv Detail & Related papers (2022-01-28T05:54:19Z) - Taming Visually Guided Sound Generation [21.397106355171946]
Recent advances in visually-induced audio generation are based on sampling short, low-fidelity, and one-class sounds.
We propose a single model capable of generating high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU.
arXiv Detail & Related papers (2021-10-17T11:14:00Z) - Voice Activity Detection for Transient Noisy Environment Based on
Diffusion Nets [13.558688470594674]
We address voice activity detection in acoustic environments of transients and stationary noises.
We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure.
A deep neural network is trained to separate speech from non-speech frames.
arXiv Detail & Related papers (2021-06-25T17:05:26Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.