Related papers: PianoMime: Learning a Generalist, Dexterous Piano Player from Internet Demonstrations

PianoMime: Learning a Generalist, Dexterous Piano Player from Internet Demonstrations

URL: http://arxiv.org/abs/2407.18178v1
Date: Thu, 25 Jul 2024 16:37:07 GMT
Title: PianoMime: Learning a Generalist, Dexterous Piano Player from Internet Demonstrations
Authors: Cheng Qian, Julen Urain, Kevin Zakka, Jan Peters,
Abstract summary: We introduce PianoMime, a framework for training a piano-playing agent using internet demonstrations. In our work, we leverage these demonstrations to learn a generalist piano-playing agent capable of playing any arbitrary song. We show that we are able to learn a policy with up to 56% F1 score on unseen songs.
Score: 21.52466727496551
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we introduce PianoMime, a framework for training a piano-playing agent using internet demonstrations. The internet is a promising source of large-scale demonstrations for training our robot agents. In particular, for the case of piano-playing, Youtube is full of videos of professional pianists playing a wide myriad of songs. In our work, we leverage these demonstrations to learn a generalist piano-playing agent capable of playing any arbitrary song. Our framework is divided into three parts: a data preparation phase to extract the informative features from the Youtube videos, a policy learning phase to train song-specific expert policies from the demonstrations and a policy distillation phase to distil the policies into a single generalist agent. We explore different policy designs to represent the agent and evaluate the influence of the amount of training data on the generalization capability of the agent to novel songs not available in the dataset. We show that we are able to learn a policy with up to 56\% F1 score on unseen songs.

Related papers

Learning to Play Piano in the Real World [3.824631943614614]
We develop the first piano playing robotic system that makes use of learning approaches while also being deployed on a real world dexterous robot. Specifically, we make use of Sim2Real to train a policy in simulation using reinforcement learning before deploying the learned policy on a real world dexterous robot.
arXiv Detail & Related papers (2025-03-19T17:56:14Z)
PianoBART: Symbolic Piano Music Generation and Understanding with Large-Scale Pre-Training [8.484581633133542]
PianoBART is a pre-trained model that uses BART for both symbolic piano music generation and understanding. We devise a multi-level object selection strategy for different pre-training tasks of PianoBART, which can prevent information leakage or loss. Experiments demonstrate that PianoBART efficiently learns musical patterns and achieves outstanding performance in generating high-quality coherent pieces.
arXiv Detail & Related papers (2024-06-26T03:35:54Z)
Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z)
Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame. ATM outperforms strong video pre-training baselines by 80% on average. We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z)
Learning to Act from Actionless Videos through Dense Correspondences [87.1243107115642]
We present an approach to construct a video-based robot policy capable of reliably executing diverse tasks across different robots and environments. Our method leverages images as a task-agnostic representation, encoding both the state and action information, and text as a general representation for specifying robot goals. We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks.
arXiv Detail & Related papers (2023-10-12T17:59:23Z)
RoboCLIP: One Demonstration is Enough to Learn Robot Policies [72.24495908759967]
RoboCLIP is an online imitation learning method that uses a single demonstration in the form of a video demonstration or a textual description of the task to generate rewards. RoboCLIP can also utilize out-of-domain demonstrations, like videos of humans solving the task for reward generation, circumventing the need to have the same demonstration and deployment domains.
arXiv Detail & Related papers (2023-10-11T21:10:21Z)
At Your Fingertips: Extracting Piano Fingering Instructions from Videos [45.643494669796866]
We consider the AI task of automating the extraction of fingering information from videos. We show how to perform this task with high-accuracy using a combination of deep-learning modules. We run the resulting system on 90 videos, resulting in high-quality piano fingering information of 150K notes.
arXiv Detail & Related papers (2023-03-07T09:09:13Z)
Pop2Piano : Pop Audio-based Piano Cover Generation [14.901465561297178]
We present Pop2Piano, a Transformer network that generates piano covers given waveforms of pop music. To the best of our knowledge, this is the first model to generate a piano cover directly from pop audio without using melody and chord extraction modules.
arXiv Detail & Related papers (2022-11-02T05:42:22Z)
Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks. weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track. We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z)
Towards Learning to Play Piano with Dexterous Hands and Touch [79.48656721563795]
We demonstrate how an agent can learn directly from machine-readable music score to play the piano with dexterous hands on a simulated piano. We achieve this by using a touch-augmented reward and a novel curriculum of tasks.
arXiv Detail & Related papers (2021-06-03T17:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.