Towards Contrastive Learning in Music Video Domain
- URL: http://arxiv.org/abs/2309.00347v1
- Date: Fri, 1 Sep 2023 09:08:21 GMT
- Title: Towards Contrastive Learning in Music Video Domain
- Authors: Karel Veldkamp, Mariya Hendriksen, Zolt\'an Szl\'avik, Alexander
Keijser
- Abstract summary: We create a dual en-coder for the audio and video modalities and train it using a bidirectional contrastive loss.
For the experiments, we use an industry dataset containing 550 000 music videos as well as the public Million Song dataset.
Our results indicate that pre-trained networks without contrastive fine-tuning outperform our contrastive learning approach when evaluated on both tasks.
- Score: 46.29203572184694
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive learning is a powerful way of learning multimodal representations
across various domains such as image-caption retrieval and audio-visual
representation learning. In this work, we investigate if these findings
generalize to the domain of music videos. Specifically, we create a dual
en-coder for the audio and video modalities and train it using a bidirectional
contrastive loss. For the experiments, we use an industry dataset containing
550 000 music videos as well as the public Million Song Dataset, and evaluate
the quality of learned representations on the downstream tasks of music tagging
and genre classification. Our results indicate that pre-trained networks
without contrastive fine-tuning outperform our contrastive learning approach
when evaluated on both tasks. To gain a better understanding of the reasons
contrastive learning was not successful for music videos, we perform a
qualitative analysis of the learned representations, revealing why contrastive
learning might have difficulties uniting embeddings from two modalities. Based
on these findings, we outline possible directions for future work. To
facilitate the reproducibility of our results, we share our code and the
pre-trained model.
Related papers
- Self-Supervised Contrastive Learning for Robust Audio-Sheet Music
Retrieval Systems [3.997809845676912]
We show that self-supervised contrastive learning can mitigate the scarcity of annotated data from real music content.
We employ the snippet embeddings in the higher-level task of cross-modal piece identification.
In this work, we observe that the retrieval quality improves from 30% up to 100% when real music data is present.
arXiv Detail & Related papers (2023-09-21T14:54:48Z) - Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains.
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z) - Learning from Untrimmed Videos: Self-Supervised Video Representation
Learning with Hierarchical Consistency [60.756222188023635]
We propose to learn representations by leveraging more abundant information in unsupervised videos.
HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos.
arXiv Detail & Related papers (2022-04-06T18:04:54Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Contrastive Learning of Image Representations with Cross-Video
Cycle-Consistency [13.19476138523546]
Cross-video relation has barely been explored for visual representation learning.
We propose a novel contrastive learning method which explores the cross-video relation by using cycle-consistency for general image representation learning.
We show significant improvement over state-of-the-art contrastive learning methods.
arXiv Detail & Related papers (2021-05-13T17:59:11Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
We learn a compositional embedding that closes the cross-modal semantic gap.
We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
arXiv Detail & Related papers (2021-04-22T09:31:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.