Related papers: Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks

Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks

URL: http://arxiv.org/abs/2505.20038v1
Date: Mon, 26 May 2025 14:24:19 GMT
Title: Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks
Authors: Chang Liu, Haomin Zhang, Shiyu Xia, Zihao Chen, Chaofan Ding, Xin Yue, Huizhe Chen, Xinhan Di,
Abstract summary: Chain-of-Perform (CoP) benchmark is a fully open-sourced, multimodal benchmark for video-guided piano music generation.<n>CoP benchmark offers detailed multimodal annotations, enabling precise semantic and temporal alignment between video content and piano audio.<n> dataset is publicly available at https://github.com/acappemin/Video-to-Audio-and-Piano.
Score: 6.895278984923356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating high-quality piano audio from video requires precise synchronization between visual cues and musical output, ensuring accurate semantic and temporal alignment.However, existing evaluation datasets do not fully capture the intricate synchronization required for piano music generation. A comprehensive benchmark is essential for two primary reasons: (1) existing metrics fail to reflect the complexity of video-to-piano music interactions, and (2) a dedicated benchmark dataset can provide valuable insights to accelerate progress in high-quality piano music generation. To address these challenges, we introduce the CoP Benchmark Dataset-a fully open-sourced, multimodal benchmark designed specifically for video-guided piano music generation. The proposed Chain-of-Perform (CoP) benchmark offers several compelling features: (1) detailed multimodal annotations, enabling precise semantic and temporal alignment between video content and piano audio via step-by-step Chain-of-Perform guidance; (2) a versatile evaluation framework for rigorous assessment of both general-purpose and specialized video-to-piano generation tasks; and (3) full open-sourcing of the dataset, annotations, and evaluation protocols. The dataset is publicly available at https://github.com/acappemin/Video-to-Audio-and-Piano, with a continuously updated leaderboard to promote ongoing research in this domain.

Related papers

VC-Bench: Pioneering the Video Connecting Benchmark with a Dataset and Evaluation Metrics [83.61875204972465]
We introduce Video Connecting, a task that aims to generate smooth intermediate video content between given start and end clips.<n>To bridge this gap, we proposed VC-Bench, a novel benchmark specifically designed for video connecting.<n> VC-Bench focuses on three core aspects: Video Quality Score VQS, Start-End Consistency Score SECS, and Transition Smoothness Score TSS.
arXiv Detail & Related papers (2026-01-27T06:15:12Z)
Two Web Toolkits for Multimodal Piano Performance Dataset Acquisition and Fingering Annotation [56.318475235705954]
We present an integrated web toolkit comprising two graphical user interfaces (GUIs)<n>PiaRec supports the synchronized acquisition of audio, video, MIDI, and performance metadata.<n> ASDF enables the efficient annotation of performer fingering from the visual data.
arXiv Detail & Related papers (2025-09-18T17:59:24Z)
PianoVAM: A Multimodal Piano Performance Dataset [56.318475235705954]
PianoVAM is a comprehensive piano performance dataset that includes videos, audio, MIDI, hand landmarks, fingering labels, and rich metadata.<n>The dataset was recorded using a Disklavier piano, capturing audio and MIDI from amateur pianists during their daily practice sessions.<n>Hand landmarks and fingering labels were extracted using a pretrained hand pose estimation model and a semi-automated fingering annotation algorithm.
arXiv Detail & Related papers (2025-09-10T17:35:58Z)
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks [67.31276358668424]
We introduce a novel task named AV-HaystacksQA, where the goal is to identify salient segments across different videos in response to a query and link them together to generate the most informative answer.<n> AVHaystacks is an audio-visual benchmark comprising 3100 annotated QA pairs designed to assess the capabilities of LMMs in multi-video retrieval and temporal grounding task.<n>We propose a model-agnostic, multi-agent framework to address this challenge, achieving up to 89% and 65% relative improvements over baseline methods on BLEU@4 and GPT evaluation scores in QA task on our proposed AVHaystack
arXiv Detail & Related papers (2025-06-08T06:34:29Z)
Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets [62.280729345770936]
We introduce the task of Alignable Video Retrieval (AVR) Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query. Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-02T20:00:49Z)
End-to-End Real-World Polyphonic Piano Audio-to-Score Transcription with Hierarchical Decoding [4.604877755214193]
Existing end-to-end piano A2S systems have been trained and evaluated with only synthetic data. We propose a sequence-to-sequence (Seq2Seq) model with a hierarchical decoder that aligns with the hierarchical structure of musical scores. We propose a two-stage training scheme, which involves pre-training the model using an expressive performance rendering system on synthetic audio, followed by fine-tuning the model using recordings of human performance.
arXiv Detail & Related papers (2024-05-22T10:52:04Z)
MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z)
TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z)
Long-Term Rhythmic Video Soundtracker [37.082768654951465]
We present Long-Term Rhythmic Video Soundtracker (LORIS), a novel framework to synthesize long-term conditional waveforms. We extend our model's applicability from dances to multiple sports scenarios such as floor exercise and figure skating. Our model generates long-term soundtracks with state-of-the-art musical quality and rhythmic correspondence.
arXiv Detail & Related papers (2023-05-02T10:58:29Z)
Video Background Music Generation: Dataset, Method and Evaluation [31.15901120245794]
We introduce a complete recipe including dataset, benchmark model, and evaluation metric for video background music generation. We present SymMV, a video and symbolic music dataset with various musical annotations. We also propose a benchmark video background music generation framework named V-MusProd.
arXiv Detail & Related papers (2022-11-21T08:39:48Z)
Audio-Visual Synchronisation in the wild [149.84890978170174]
We identify and curate a test set with high audio-visual correlation, namely VGG-Sound Sync. We compare a number of transformer-based architectural variants specifically designed to model audio and visual signals of arbitrary length. We set the first benchmark for general audio-visual synchronisation with over 160 diverse classes in the new VGG-Sound Sync video dataset.
arXiv Detail & Related papers (2021-12-08T17:50:26Z)
Lets Play Music: Audio-driven Performance Video Generation [58.77609661515749]
We propose a new task named Audio-driven Per-formance Video Generation (APVG) APVG aims to synthesize the video of a person playing a certain instrument guided by a given music audio clip.
arXiv Detail & Related papers (2020-11-05T03:13:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.