Learning Music Sequence Representation from Text Supervision
- URL: http://arxiv.org/abs/2305.19602v1
- Date: Wed, 31 May 2023 07:15:06 GMT
- Title: Learning Music Sequence Representation from Text Supervision
- Authors: Tianyu Chen, Yuan Xie, Shuai Zhang, Shaohan Huang, Haoyi Zhou, Jianxin
Li
- Abstract summary: Music representation learning is notoriously difficult for its complex human-related concepts contained in the sequence of numerical signals.
We propose a novel text-supervision pre-training method, namely M. M.
It only requires 0.056% of pre-training data to achieve the state-of-the-art performance.
- Score: 31.90882003611554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Music representation learning is notoriously difficult for its complex
human-related concepts contained in the sequence of numerical signals. To
excavate better MUsic SEquence Representation from labeled audio, we propose a
novel text-supervision pre-training method, namely MUSER. MUSER adopts an
audio-spectrum-text tri-modal contrastive learning framework, where the text
input could be any form of meta-data with the help of text templates while the
spectrum is derived from an audio sequence. Our experiments reveal that MUSER
could be more flexibly adapted to downstream tasks compared with the current
data-hungry pre-training method, and it only requires 0.056% of pre-training
data to achieve the state-of-the-art performance.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Text Conditioned Symbolic Drumbeat Generation using Latent Diffusion Models [0.0]
This study introduces a text-conditioned approach to generating drumbeats with Latent Diffusion Models (LDMs)
By pretraining a text and drumbeat encoder through contrastive learning within a multimodal network, we align the modalities of text and music closely.
We show that the generated drumbeats are novel and apt to the prompt text, and comparable in quality to those created by human musicians.
arXiv Detail & Related papers (2024-08-05T13:23:05Z) - Learning Robust Named Entity Recognizers From Noisy Data With Retrieval Augmentation [67.89838237013078]
Named entity recognition (NER) models often struggle with noisy inputs.
We propose a more realistic setting in which only noisy text and its NER labels are available.
We employ a multi-view training framework that improves robust NER without retrieving text during inference.
arXiv Detail & Related papers (2024-07-26T07:30:41Z) - Semi-supervised Text-based Person Search [47.14739994781334]
Existing methods rely on massive annotated image-text data to achieve satisfactory performance in fully-supervised learning.
We present a two-stage basic solution based on generation-then-retrieval for semi-supervised TBPS.
We propose a noise-robust retrieval framework that enhances the ability of the retrieval model to handle noisy data.
arXiv Detail & Related papers (2024-04-28T07:47:52Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Self-supervised Context-aware Style Representation for Expressive Speech
Synthesis [23.460258571431414]
We propose a novel framework for learning style representation from plain text in a self-supervised manner.
It leverages an emotion lexicon and uses contrastive learning and deep clustering.
Our method achieves improved results according to subjective evaluations on both in-domain and out-of-domain test sets in audiobook speech.
arXiv Detail & Related papers (2022-06-25T05:29:48Z) - SVTS: Scalable Video-to-Speech Synthesis [105.29009019733803]
We introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder.
We are the first to show intelligible results on the challenging LRS3 dataset.
arXiv Detail & Related papers (2022-05-04T13:34:07Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Audio-text Retrieval in Context [24.38055340045366]
In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment.
We build our contextual audio-text retrieval system using pre-trained audio features and a descriptor-based aggregation method.
With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.
arXiv Detail & Related papers (2022-03-25T13:41:17Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z) - TEACHTEXT: CrossModal Generalized Distillation for Text-Video Retrieval [103.85002875155551]
We propose a novel generalized distillation method, TeachText, for exploiting large-scale language pretraining.
We extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time.
Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time.
arXiv Detail & Related papers (2021-04-16T17:55:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.