Audio-to-Image Cross-Modal Generation
- URL: http://arxiv.org/abs/2109.13354v1
- Date: Mon, 27 Sep 2021 21:25:31 GMT
- Title: Audio-to-Image Cross-Modal Generation
- Authors: Maciej \.Zelaszczyk and Jacek Ma\'ndziuk
- Abstract summary: Cross-modal representation learning allows to integrate information from different modalities into one representation.
We train variational autoencoders (VAEs) to reconstruct image archetypes from audio data.
Our results suggest that even in the case when the generated images are relatively inconsistent (diverse), features that are critical for proper image classification are preserved.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-modal representation learning allows to integrate information from
different modalities into one representation. At the same time, research on
generative models tends to focus on the visual domain with less emphasis on
other domains, such as audio or text, potentially missing the benefits of
shared representations. Studies successfully linking more than one modality in
the generative setting are rare. In this context, we verify the possibility to
train variational autoencoders (VAEs) to reconstruct image archetypes from
audio data. Specifically, we consider VAEs in an adversarial training framework
in order to ensure more variability in the generated data and find that there
is a trade-off between the consistency and diversity of the generated images -
this trade-off can be governed by scaling the reconstruction loss up or down,
respectively. Our results further suggest that even in the case when the
generated images are relatively inconsistent (diverse), features that are
critical for proper image classification are preserved.
Related papers
- Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition.
Our method uses the attention mechanism to correlate multiple images within a batch.
Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z) - InVA: Integrative Variational Autoencoder for Harmonization of
Multi-modal Neuroimaging Data [3.792342522967013]
This article proposes a novel approach, referred to as Integrative Variational Autoencoder (textttInVA) method, which borrows information from multiple images obtained from different sources to draw predictive inference of an image.
Numerical results demonstrate substantial advantages of textttInVA over VAEs, which typically do not allow borrowing information between input images.
arXiv Detail & Related papers (2024-02-05T05:26:17Z) - Unlocking Pre-trained Image Backbones for Semantic Image Synthesis [29.688029979801577]
We propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images.
Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes.
arXiv Detail & Related papers (2023-12-20T09:39:19Z) - Traditional Classification Neural Networks are Good Generators: They are
Competitive with DDPMs and GANs [104.72108627191041]
We show that conventional neural network classifiers can generate high-quality images comparable to state-of-the-art generative models.
We propose a mask-based reconstruction module to make semantic gradients-aware to synthesize plausible images.
We show that our method is also applicable to text-to-image generation by regarding image-text foundation models.
arXiv Detail & Related papers (2022-11-27T11:25:35Z) - Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z) - Reinforcing Generated Images via Meta-learning for One-Shot Fine-Grained
Visual Recognition [36.02360322125622]
We propose a meta-learning framework to combine generated images with original images, so that the resulting "hybrid" training images improve one-shot learning.
Our experiments demonstrate consistent improvement over baselines on one-shot fine-grained image classification benchmarks.
arXiv Detail & Related papers (2022-04-22T13:11:05Z) - Cross-Modality Sub-Image Retrieval using Contrastive Multimodal Image
Representations [3.3754780158324564]
Cross-modality image retrieval is challenging, since images of similar (or even the same) content captured by different modalities might share few common structures.
We propose a new application-independent content-based image retrieval system for reverse (sub-)image search across modalities.
arXiv Detail & Related papers (2022-01-10T19:04:28Z) - Ensembling with Deep Generative Views [72.70801582346344]
generative models can synthesize "views" of artificial images that mimic real-world variations, such as changes in color or pose.
Here, we investigate whether such views can be applied to real images to benefit downstream analysis tasks such as image classification.
We use StyleGAN2 as the source of generative augmentations and investigate this setup on classification tasks involving facial attributes, cat faces, and cars.
arXiv Detail & Related papers (2021-04-29T17:58:35Z) - Generating Annotated High-Fidelity Images Containing Multiple Coherent
Objects [10.783993190686132]
We propose a multi-object generation framework that can synthesize images with multiple objects without explicitly requiring contextual information.
We demonstrate how coherency and fidelity are preserved with our method through experiments on the Multi-MNIST and CLEVR datasets.
arXiv Detail & Related papers (2020-06-22T11:33:55Z) - Image Fine-grained Inpainting [89.17316318927621]
We present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields.
To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss.
We also employ a discriminator with local and global branches to ensure local-global contents consistency.
arXiv Detail & Related papers (2020-02-07T03:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.