It's Time for Artistic Correspondence in Music and Video
- URL: http://arxiv.org/abs/2206.07148v1
- Date: Tue, 14 Jun 2022 20:21:04 GMT
- Title: It's Time for Artistic Correspondence in Music and Video
- Authors: Didac Suris, Carl Vondrick, Bryan Russell, Justin Salamon
- Abstract summary: We present an approach for recommending a music track for a given video, and vice versa, based on both their temporal alignment and their correspondence at an artistic level.
We propose a self-supervised approach that learns this correspondence directly from data, without any need of human annotations.
Experiments show that this approach strongly outperforms alternatives that do not exploit the temporal context.
- Score: 32.31962546363909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present an approach for recommending a music track for a given video, and
vice versa, based on both their temporal alignment and their correspondence at
an artistic level. We propose a self-supervised approach that learns this
correspondence directly from data, without any need of human annotations. In
order to capture the high-level concepts that are required to solve the task,
we propose modeling the long-term temporal context of both the video and the
music signals, using Transformer networks for each modality. Experiments show
that this approach strongly outperforms alternatives that do not exploit the
temporal context. The combination of our contributions improve retrieval
accuracy up to 10x over prior state of the art. This strong improvement allows
us to introduce a wide range of analyses and applications. For instance, we can
condition music retrieval based on visually defined attributes.
Related papers
- MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [52.498942604622165]
This paper presents MuVi, a framework to generate music that aligns with video content.
MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features.
We show that MuVi demonstrates superior performance in both audio quality and temporal synchronization.
arXiv Detail & Related papers (2024-10-16T18:44:56Z) - Towards Leveraging Contrastively Pretrained Neural Audio Embeddings for Recommender Tasks [18.95453617434051]
Music recommender systems frequently utilize network-based models to capture relationships between music pieces, artists, and users.
New music pieces or artists often face the cold-start problem due to insufficient initial information.
To address this, one can extract content-based information directly from the music to enhance collaborative-filtering-based methods.
arXiv Detail & Related papers (2024-09-13T17:53:06Z) - Time Does Tell: Self-Supervised Time-Tuning of Dense Image
Representations [79.87044240860466]
We propose a novel approach that incorporates temporal consistency in dense self-supervised learning.
Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos.
Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images.
arXiv Detail & Related papers (2023-08-22T21:28:58Z) - Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - Language-Guided Music Recommendation for Video via Prompt Analogies [35.48998901411509]
We propose a method to recommend music for an input video while allowing a user to guide music selection with free-form natural language.
Existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music.
arXiv Detail & Related papers (2023-06-15T17:58:01Z) - Generative Disco: Text-to-Video Generation for Music Visualization [9.53563436241774]
We introduce Generative Disco, a generative AI system that helps generate music visualizations with large language models and text-to-video generation.
The system helps users visualize music in intervals by finding prompts to describe the images that intervals start and end on and interpolating between them to the beat of the music.
We introduce design patterns for improving these generated videos: transitions, which express shifts in color, time, subject, or style, and holds, which help focus the video on subjects.
arXiv Detail & Related papers (2023-04-17T18:44:00Z) - HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts.
We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z) - Video Background Music Generation: Dataset, Method and Evaluation [31.15901120245794]
We introduce a complete recipe including dataset, benchmark model, and evaluation metric for video background music generation.
We present SymMV, a video and symbolic music dataset with various musical annotations.
We also propose a benchmark video background music generation framework named V-MusProd.
arXiv Detail & Related papers (2022-11-21T08:39:48Z) - Tr\"aumerAI: Dreaming Music with StyleGAN [2.578242050187029]
We propose a neural music visualizer directly mapping deep music embeddings to style embeddings of StyleGAN.
An annotator listened to 100 music clips of 10 seconds long and selected an image that suits the music among the StyleGAN-generated examples.
The generated examples show that the mapping between audio and video makes a certain level of intra-segment similarity and inter-segment dissimilarity.
arXiv Detail & Related papers (2021-02-09T07:04:22Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z) - Blind Video Temporal Consistency via Deep Video Prior [61.062900556483164]
We present a novel and general approach for blind video temporal consistency.
Our method is only trained on a pair of original and processed videos directly.
We show that temporal consistency can be achieved by training a convolutional network on a video with the Deep Video Prior.
arXiv Detail & Related papers (2020-10-22T16:19:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.