Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for
Speech-to-Image Generation
- URL: http://arxiv.org/abs/2305.10126v1
- Date: Wed, 17 May 2023 11:12:07 GMT
- Title: Fusion-S2iGan: An Efficient and Effective Single-Stage Framework for
Speech-to-Image Generation
- Authors: Zhenxing Zhang and Lambert Schomaker
- Abstract summary: The goal of a speech-to-image transform is to produce a photo-realistic picture directly from a speech signal.
We propose a single-stage framework called Fusion-S2iGan to yield perceptually plausible and semantically consistent image samples.
- Score: 8.26410341981427
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of a speech-to-image transform is to produce a photo-realistic
picture directly from a speech signal. Recently, various studies have focused
on this task and have achieved promising performance. However, current
speech-to-image approaches are based on a stacked modular framework that
suffers from three vital issues: 1) Training separate networks is
time-consuming as well as inefficient and the convergence of the final
generative model strongly depends on the previous generators; 2) The quality of
precursor images is ignored by this architecture; 3) Multiple discriminator
networks are required to be trained. To this end, we propose an efficient and
effective single-stage framework called Fusion-S2iGan to yield perceptually
plausible and semantically consistent image samples on the basis of given
spoken descriptions. Fusion-S2iGan introduces a visual+speech fusion module
(VSFM), constructed with a pixel-attention module (PAM), a speech-modulation
module (SMM) and a weighted-fusion module (WFM), to inject the speech embedding
from a speech encoder into the generator while improving the quality of
synthesized pictures. Fusion-S2iGan spreads the bimodal information over all
layers of the generator network to reinforce the visual feature maps at various
hierarchical levels in the architecture. We conduct a series of experiments on
four benchmark data sets, i.e., CUB birds, Oxford-102, Flickr8k and
Places-subset. The experimental results demonstrate the superiority of the
presented Fusion-S2iGan compared to the state-of-the-art models with a
multi-stage architecture and a performance level that is close to traditional
text-to-image approaches.
Related papers
- DiM-Gestor: Co-Speech Gesture Generation with Adaptive Layer Normalization Mamba-2 [6.6954598568836925]
DiM-Gestor is an end-to-end generative model leveraging the Mamba-2 architecture.
A fuzzy feature extractor and a speech-to-gesture mapping module are built on the Mamba-2.
Our approach delivers competitive results and significantly reduces memory usage, approximately 2.4 times, and enhances inference speeds by 2 to 4 times.
arXiv Detail & Related papers (2024-11-23T08:02:03Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and
Latent Diffusion [50.59261592343479]
We present Kandinsky1, a novel exploration of latent diffusion architecture.
The proposed model is trained separately to map text embeddings to image embeddings of CLIP.
We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting.
arXiv Detail & Related papers (2023-10-05T12:29:41Z) - Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - DiffDis: Empowering Generative Diffusion Model with Cross-Modal
Discrimination Capability [75.9781362556431]
We propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.
We show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks.
arXiv Detail & Related papers (2023-08-18T05:03:48Z) - Image as a Foreign Language: BEiT Pretraining for All Vision and
Vision-Language Tasks [87.6494641931349]
We introduce a general-purpose multimodal foundation model BEiT-3.
It achieves state-of-the-art transfer performance on both vision and vision-language tasks.
arXiv Detail & Related papers (2022-08-22T16:55:04Z) - Semantic Image Synthesis via Diffusion Models [159.4285444680301]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.
Recent work on semantic image synthesis mainly follows the emphde facto Generative Adversarial Nets (GANs)
arXiv Detail & Related papers (2022-06-30T18:31:51Z) - DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image
Generation [8.26410341981427]
The Dual Attention Generative Adversarial Network (DTGAN) can synthesize high-quality and semantically consistent images.
The proposed model introduces channel-aware and pixel-aware attention modules that can guide the generator to focus on text-relevant channels and pixels.
A new type of visual loss is utilized to enhance the image resolution by ensuring vivid shape and perceptually uniform color distributions of generated images.
arXiv Detail & Related papers (2020-11-05T08:57:15Z) - Efficient and Model-Based Infrared and Visible Image Fusion Via
Algorithm Unrolling [24.83209572888164]
Infrared and visible image fusion (IVIF) expects to obtain images that retain thermal radiation information from infrared images and texture details from visible images.
A model-based convolutional neural network (CNN) model is proposed to overcome the shortcomings of traditional CNN-based IVIF models.
arXiv Detail & Related papers (2020-05-12T16:15:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.