Exploring Latent Spaces of Tonal Music using Variational Autoencoders
- URL: http://arxiv.org/abs/2311.03621v1
- Date: Tue, 7 Nov 2023 00:15:29 GMT
- Title: Exploring Latent Spaces of Tonal Music using Variational Autoencoders
- Authors: N\'adia Carvalho, Gilberto Bernardes
- Abstract summary: Variational Autoencoders (VAEs) have proven to be effective models for producing latent representations of cognitive and semantic value.
We assess the degree to which VAEs trained on a prototypical tonal music corpus of 371 Bach's chorales.
- Score: 0.9065034043031668
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Variational Autoencoders (VAEs) have proven to be effective models for
producing latent representations of cognitive and semantic value. We assess the
degree to which VAEs trained on a prototypical tonal music corpus of 371 Bach's
chorales define latent spaces representative of the circle of fifths and the
hierarchical relation of each key component pitch as drawn in music cognition.
In detail, we compare the latent space of different VAE corpus encodings --
Piano roll, MIDI, ABC, Tonnetz, DFT of pitch, and pitch class distributions --
in providing a pitch space for key relations that align with cognitive
distances. We evaluate the model performance of these encodings using objective
metrics to capture accuracy, mean square error (MSE), KL-divergence, and
computational cost. The ABC encoding performs the best in reconstructing the
original data, while the Pitch DFT seems to capture more information from the
latent space. Furthermore, an objective evaluation of 12 major or minor
transpositions per piece is adopted to quantify the alignment of 1) intra- and
inter-segment distances per key and 2) the key distances to cognitive pitch
spaces. Our results show that Pitch DFT VAE latent spaces align best with
cognitive spaces and provide a common-tone space where overlapping objects
within a key are fuzzy clusters, which impose a well-defined order of
structural significance or stability -- i.e., a tonal hierarchy. Tonal
hierarchies of different keys can be used to measure key distances and the
relationships of their in-key components at multiple hierarchies (e.g., notes
and chords). The implementation of our VAE and the encodings framework are made
available online.
Related papers
- Frequency-Spatial Entanglement Learning for Camouflaged Object Detection [34.426297468968485]
Existing methods attempt to reduce the impact of pixel similarity by maximizing the distinguishing ability of spatial features with complicated design.
We propose a new approach to address this issue by jointly exploring the representation in the frequency and spatial domains, introducing the Frequency-Spatial Entanglement Learning (FSEL) method.
Our experiments demonstrate the superiority of our FSEL over 21 state-of-the-art methods, through comprehensive quantitative and qualitative comparisons in three widely-used datasets.
arXiv Detail & Related papers (2024-09-03T07:58:47Z) - Free-text Keystroke Authentication using Transformers: A Comparative
Study of Architectures and Loss Functions [1.0152838128195467]
Keystroke biometrics is a promising approach for user identification and verification, leveraging the unique patterns in individuals' typing behavior.
We propose a Transformer-based network that employs self-attention to extract informative features from keystroke sequences.
Our model surpasses the previous state-of-the-art in free-text keystroke authentication.
arXiv Detail & Related papers (2023-10-18T00:34:26Z) - Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching [74.75284453828017]
Open-Vocabulary Keypoint Detection (OVKD) task is innovatively designed to use text prompts for identifying arbitrary keypoints across any species.
We have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM)
This framework combines vision and language models, creating an interplay between language features and local keypoint visual features.
arXiv Detail & Related papers (2023-10-08T07:42:41Z) - DEFT: A new distance-based feature set for keystroke dynamics [1.8796659304823702]
We propose a new set of features based on the distance between keys on the keyboard, a concept that has not been considered before in keystroke dynamics.
We build a DEFT model by combining DEFT features with other previously used keystroke dynamic features.
The DEFT model is designed to be device-agnostic, allowing us to evaluate its effectiveness across three commonly used devices.
arXiv Detail & Related papers (2023-10-06T07:26:40Z) - Bridging the Domain Gaps in Context Representations for k-Nearest
Neighbor Neural Machine Translation [57.49095610777317]
$k$-Nearest neighbor machine translation ($k$NN-MT) has attracted increasing attention due to its ability to non-parametrically adapt to new translation domains.
We propose a novel approach to boost the datastore retrieval of $k$NN-MT by reconstructing the original datastore.
Our method can effectively boost the datastore retrieval and translation quality of $k$NN-MT.
arXiv Detail & Related papers (2023-05-26T03:04:42Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Dense Interaction Learning for Video-based Person Re-identification [75.03200492219003]
We propose a hybrid framework, Dense Interaction Learning (DenseIL), to tackle video-based person re-ID difficulties.
DenseIL contains a CNN encoder and a Dense Interaction (DI) decoder.
Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based re-ID datasets.
arXiv Detail & Related papers (2021-03-16T12:22:08Z) - Suppress and Balance: A Simple Gated Network for Salient Object
Detection [89.88222217065858]
We propose a simple gated network (GateNet) to solve both issues at once.
With the help of multilevel gate units, the valuable context information from the encoder can be optimally transmitted to the decoder.
In addition, we adopt the atrous spatial pyramid pooling based on the proposed "Fold" operation (Fold-ASPP) to accurately localize salient objects of various scales.
arXiv Detail & Related papers (2020-07-16T02:00:53Z) - Learning Style-Aware Symbolic Music Representations by Adversarial
Autoencoders [9.923470453197657]
We focus on leveraging adversarial regularization as a flexible and natural mean to imbue variational autoencoders with context information.
We introduce the first Music Adversarial Autoencoder (MusAE)
Our model has a higher reconstruction accuracy than state-of-the-art models based on standard variational autoencoders.
arXiv Detail & Related papers (2020-01-15T18:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.