The artificial synesthete: Image-melody translations with variational
autoencoders
- URL: http://arxiv.org/abs/2112.02953v1
- Date: Mon, 6 Dec 2021 11:54:13 GMT
- Title: The artificial synesthete: Image-melody translations with variational
autoencoders
- Authors: Karl Wienand, Wolfgang M. Heckl
- Abstract summary: A network learns a set of correspondences between musical and visual concepts from repeated joint exposure.
The resulting "artificial synesthete" generates simple melodies inspired by images, and images from music.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Abstract This project presents a system of neural networks to translate
between images and melodies. Autoencoders compress the information in samples
to abstract representation. A translation network learns a set of
correspondences between musical and visual concepts from repeated joint
exposure. The resulting "artificial synesthete" generates simple melodies
inspired by images, and images from music. These are novel interpretation (not
transposed data), expressing the machine' perception and understanding.
Observing the work, one explores the machine's perception and thus, by
contrast, one's own.
Related papers
- Interpreting Graphic Notation with MusicLDM: An AI Improvisation of Cornelius Cardew's Treatise [4.9485163144728235]
This work presents a novel method for composing and improvising music inspired by Cornelius Cardew's Treatise.
By leveraging OpenAI's ChatGPT to interpret the abstract visual elements of Treatise, we convert these graphical images into descriptive textual prompts.
These prompts are then input into MusicLDM, a pre-trained latent diffusion model designed for music generation.
arXiv Detail & Related papers (2024-12-12T05:08:36Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - Multimodal Neurons in Pretrained Text-Only Transformers [52.20828443544296]
We identify "multimodal neurons" that convert visual representations into corresponding text.
We show that multimodal neurons operate on specific visual concepts across inputs, and have a systematic causal effect on image captioning.
arXiv Detail & Related papers (2023-08-03T05:27:12Z) - Advances in Neural Rendering [115.05042097988768]
This report focuses on methods that combine classical rendering with learned 3D scene representations.
A key advantage of these methods is that they are 3D-consistent by design, enabling applications such as novel viewpoint of a captured scene.
In addition to methods that handle static scenes, we cover neural scene representations for modeling non-rigidly deforming objects.
arXiv Detail & Related papers (2021-11-10T18:57:01Z) - Controlled Caption Generation for Images Through Adversarial Attacks [85.66266989600572]
We study adversarial examples for vision and language models, which typically adopt a Convolutional Neural Network (i.e., CNN) for image feature extraction and a Recurrent Neural Network (RNN) for caption generation.
In particular, we investigate attacks on the visual encoder's hidden layer that is fed to the subsequent recurrent network.
We propose a GAN-based algorithm for crafting adversarial examples for neural image captioning that mimics the internal representation of the CNN.
arXiv Detail & Related papers (2021-07-07T07:22:41Z) - Analogical Reasoning for Visually Grounded Language Acquisition [55.14286413675306]
Children acquire language subconsciously by observing the surrounding world and listening to descriptions.
In this paper, we bring this ability to AI, by studying the task of Visually grounded Language Acquisition.
We propose a multimodal transformer model augmented with a novel mechanism for analogical reasoning.
arXiv Detail & Related papers (2020-07-22T20:51:58Z) - Embeddings as representation for symbolic music [0.0]
A representation technique that allows encoding music in a way that contains musical meaning would improve the results of any model trained for computer music tasks.
In this paper, we experiment with embeddings to represent musical notes from 3 different variations of a dataset and analyze if the model can capture useful musical patterns.
arXiv Detail & Related papers (2020-05-19T13:04:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.