A General Framework for Learning Prosodic-Enhanced Representation of Rap
Lyrics
- URL: http://arxiv.org/abs/2103.12615v1
- Date: Tue, 23 Mar 2021 15:13:21 GMT
- Title: A General Framework for Learning Prosodic-Enhanced Representation of Rap
Lyrics
- Authors: Hongru Liang, Haozheng Wang, Qian Li, Jun Wang, Guandong Xu, Jiawei
Chen, Jin-Mao Wei, Zhenglu Yang
- Abstract summary: Learning and analyzing rap lyrics is a significant basis for many web applications.
We propose a hierarchical attention variational autoencoder framework (HAVAE)
A feature aggregation strategy is proposed to appropriately integrate various features and generate prosodic-enhanced representation.
- Score: 21.944835086749375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning and analyzing rap lyrics is a significant basis for many web
applications, such as music recommendation, automatic music categorization, and
music information retrieval, due to the abundant source of digital music in the
World Wide Web. Although numerous studies have explored the topic, knowledge in
this field is far from satisfactory, because critical issues, such as prosodic
information and its effective representation, as well as appropriate
integration of various features, are usually ignored. In this paper, we propose
a hierarchical attention variational autoencoder framework (HAVAE), which
simultaneously consider semantic and prosodic features for rap lyrics
representation learning. Specifically, the representation of the prosodic
features is encoded by phonetic transcriptions with a novel and effective
strategy~(i.e., rhyme2vec). Moreover, a feature aggregation strategy is
proposed to appropriately integrate various features and generate
prosodic-enhanced representation. A comprehensive empirical evaluation
demonstrates that the proposed framework outperforms the state-of-the-art
approaches under various metrics in different rap lyrics learning tasks.
Related papers
- VyAnG-Net: A Novel Multi-Modal Sarcasm Recognition Model by Uncovering Visual, Acoustic and Glossary Features [13.922091192207718]
Sarcasm recognition aims to identify hidden sarcastic, criticizing, and metaphorical information embedded in everyday dialogue.
We propose a novel approach that combines a lightweight depth attention module with a self-regulated ConvNet to concentrate on the most crucial features of visual data.
We have also conducted a cross-dataset analysis to test the adaptability of VyAnG-Net with unseen samples of another dataset MUStARD++.
arXiv Detail & Related papers (2024-08-05T15:36:52Z) - Detecting Synthetic Lyrics with Few-Shot Inference [5.448536338411993]
We have curated the first dataset of high-quality synthetic lyrics.
Our best few-shot detector, based on LLM2Vec, surpasses stylistic and statistical methods.
This study emphasizes the need for further research on creative content detection.
arXiv Detail & Related papers (2024-06-21T15:19:21Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - Language Guided Domain Generalized Medical Image Segmentation [68.93124785575739]
Single source domain generalization holds promise for more reliable and consistent image segmentation across real-world clinical settings.
We propose an approach that explicitly leverages textual information by incorporating a contrastive learning mechanism guided by the text encoder features.
Our approach achieves favorable performance against existing methods in literature.
arXiv Detail & Related papers (2024-04-01T17:48:15Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - A Phoneme-Informed Neural Network Model for Note-Level Singing
Transcription [11.951441023641975]
We propose a method of finding note onsets of singing voice more accurately by leveraging the linguistic characteristics of singing.
Our approach substantially improves the performance of singing transcription and emphasizes the importance of linguistic features in singing analysis.
arXiv Detail & Related papers (2023-04-12T15:36:01Z) - The Music Annotation Pattern [1.2043574473965315]
We introduce the Music Pattern, an Ontology Design Pattern (ODP) to homogenise different annotation systems and to represent several types of musical objects.
Our ODP accounts for multi-modality upfront, to describe annotations derived from different sources, and it is the first to enable the integration of music datasets at a large scale.
arXiv Detail & Related papers (2023-03-30T11:13:59Z) - Knowledge Graph Augmented Network Towards Multiview Representation
Learning for Aspect-based Sentiment Analysis [96.53859361560505]
We propose a knowledge graph augmented network (KGAN) to incorporate external knowledge with explicitly syntactic and contextual information.
KGAN captures the sentiment feature representations from multiple perspectives, i.e., context-, syntax- and knowledge-based.
Experiments on three popular ABSA benchmarks demonstrate the effectiveness and robustness of our KGAN.
arXiv Detail & Related papers (2022-01-13T08:25:53Z) - High-dimensional distributed semantic spaces for utterances [0.2907403645801429]
This paper describes a model for high-dimensional representation for utterance and text level data.
It is based on a mathematically principled and behaviourally plausible approach to representing linguistic information.
The paper shows how the implemented model is able to represent a broad range of linguistic features in a common integral framework of fixed dimensionality.
arXiv Detail & Related papers (2021-04-01T12:09:47Z) - GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions.
The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z) - Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective.
The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone.
The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.