InverseMV: Composing Piano Scores with a Convolutional Video-Music
Transformer
- URL: http://arxiv.org/abs/2112.15320v1
- Date: Fri, 31 Dec 2021 06:39:28 GMT
- Title: InverseMV: Composing Piano Scores with a Convolutional Video-Music
Transformer
- Authors: Chin-Tung Lin, Mu Yang
- Abstract summary: We propose a novel attention-based model VMT that automatically generates piano scores from video frames.
Using music generated from models also prevent potential copyright infringements.
We release a new dataset composed of over 7 hours of piano scores with fine alignment between pop music videos and MIDI files.
- Score: 2.157478102241537
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many social media users prefer consuming content in the form of videos rather
than text. However, in order for content creators to produce videos with a high
click-through rate, much editing is needed to match the footage to the music.
This posts additional challenges for more amateur video makers. Therefore, we
propose a novel attention-based model VMT (Video-Music Transformer) that
automatically generates piano scores from video frames. Using music generated
from models also prevent potential copyright infringements that often come with
using existing music. To the best of our knowledge, there is no work besides
the proposed VMT that aims to compose music for video. Additionally, there
lacks a dataset with aligned video and symbolic music. We release a new dataset
composed of over 7 hours of piano scores with fine alignment between pop music
videos and MIDI files. We conduct experiments with human evaluation on VMT,
SeqSeq model (our baseline), and the original piano version soundtrack. VMT
achieves consistent improvements over the baseline on music smoothness and
video relevance. In particular, with the relevance scores and our case study,
our model has shown the capability of multimodality on frame-level actors'
movement for music generation. Our VMT model, along with the new dataset,
presents a promising research direction toward composing the matching
soundtrack for videos. We have released our code at
https://github.com/linchintung/VMT
Related papers
- MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization [52.498942604622165]
This paper presents MuVi, a framework to generate music that aligns with video content.
MuVi analyzes video content through a specially designed visual adaptor to extract contextually and temporally relevant features.
We show that MuVi demonstrates superior performance in both audio quality and temporal synchronization.
arXiv Detail & Related papers (2024-10-16T18:44:56Z) - VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos [32.741262543860934]
We present a framework for learning to generate background music from video inputs.
We develop a generative video-music Transformer with a novel semantic video-music alignment scheme.
New temporal video encoder architecture allows us to efficiently process videos consisting of many densely sampled frames.
arXiv Detail & Related papers (2024-09-11T17:56:48Z) - VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling [71.01050359126141]
We propose VidMuse, a framework for generating music aligned with video inputs.
VidMuse produces high-fidelity music that is both acoustically and semantically aligned with the video.
arXiv Detail & Related papers (2024-06-06T17:58:11Z) - Video2Music: Suitable Music Generation from Videos using an Affective
Multimodal Transformer model [32.801213106782335]
We develop a generative music AI framework, Video2Music, that can match a provided video.
In a thorough experiment, we show that our proposed framework can generate music that matches the video content in terms of emotion.
arXiv Detail & Related papers (2023-11-02T03:33:00Z) - V2Meow: Meowing to the Visual Beat via Video-to-Music Generation [47.076283429992664]
V2Meow is a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types.
It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames.
arXiv Detail & Related papers (2023-05-11T06:26:41Z) - Video Background Music Generation: Dataset, Method and Evaluation [31.15901120245794]
We introduce a complete recipe including dataset, benchmark model, and evaluation metric for video background music generation.
We present SymMV, a video and symbolic music dataset with various musical annotations.
We also propose a benchmark video background music generation framework named V-MusProd.
arXiv Detail & Related papers (2022-11-21T08:39:48Z) - Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive
Transformer [66.56167074658697]
We present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames.
Our evaluation shows that our model trained on 16-frame video clips can generate diverse, coherent, and high-quality long videos.
We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio.
arXiv Detail & Related papers (2022-04-07T17:59:02Z) - Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG)
APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z) - Foley Music: Learning to Generate Music from Videos [115.41099127291216]
Foley Music is a system that can synthesize plausible music for a silent video clip about people playing musical instruments.
We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings.
We present a Graph$-$Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements.
arXiv Detail & Related papers (2020-07-21T17:59:06Z) - Audeo: Audio Generation for a Silent Performance Video [17.705770346082023]
We present a novel system that gets as an input video frames of a musician playing the piano and generates the music for that video.
Our main aim in this work is to explore the plausibility of such a transformation and to identify cues and components able to carry the association of sounds with visual events.
arXiv Detail & Related papers (2020-06-23T00:58:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.