Speak the Art: A Direct Speech to Image Generation Framework
- URL: http://arxiv.org/abs/2601.00827v1
- Date: Wed, 24 Dec 2025 10:49:00 GMT
- Title: Speak the Art: A Direct Speech to Image Generation Framework
- Authors: Mariam Saeed, Manar Amr, Farida Adel, Nada Hassan, Nour Walid, Eman Mohamed, Mohamed Hussein, Marwan Torki,
- Abstract summary: We introduce a framework called textbfSpeak the Art (STA) which consists of a speech encoding network and a VQ-Diffusion network conditioned on speech embeddings.<n>To improve speech embeddings, the speech encoding network is supervised by a large pre-trained image-text model during training.<n>As a proof of concept, we trained our framework with two languages: English and Arabic.
- Score: 3.372751145910977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Direct speech-to-image generation has recently shown promising results. However, compared to text-to-image generation, there is still a large gap to enclose. Current approaches use two stages to tackle this task: speech encoding network and image generative adversarial network (GAN). The speech encoding networks in these approaches produce embeddings that do not capture sufficient linguistic information to semantically represent the input speech. GANs suffer from issues such as non-convergence, mode collapse, and diminished gradient, which result in unstable model parameters, limited sample diversity, and ineffective generator learning, respectively. To address these weaknesses, we introduce a framework called \textbf{Speak the Art (STA)} which consists of a speech encoding network and a VQ-Diffusion network conditioned on speech embeddings. To improve speech embeddings, the speech encoding network is supervised by a large pre-trained image-text model during training. Replacing GANs with diffusion leads to more stable training and the generation of diverse images. Additionally, we investigate the feasibility of extending our framework to be multilingual. As a proof of concept, we trained our framework with two languages: English and Arabic. Finally, we show that our results surpass state-of-the-art models by a large margin.
Related papers
- Rhetorical Text-to-Image Generation via Two-layer Diffusion Policy Optimization [6.169979847774362]
Rhet2Pix is a framework that formulates rhetorical text-to-image generation as a multi-step policy optimization problem.<n>Our model outperforms SOTA MLLMs such as GPT-4o, Grok-3 and leading academic baselines across both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2025-05-28T19:03:37Z) - Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach [3.89476785897726]
We introduce and study a sequence-to-sequence (seq2seq) speech in-painting model that incorporates AV features.
Our approach extends AV speech in-painting techniques to scenarios where both audio and visual data may be jointly corrupted.
arXiv Detail & Related papers (2024-06-02T23:51:43Z) - SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences.
We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z) - Towards Practical and Efficient Image-to-Speech Captioning with
Vision-Language Pre-training and Multi-modal Tokens [87.52235889917223]
We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model.
With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases.
arXiv Detail & Related papers (2023-09-15T16:48:34Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - TIME: Text and Image Mutual-Translation Adversarial Networks [55.1298552773457]
We propose Text and Image Mutual-Translation Adversarial Networks (TIME)
TIME learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework.
In experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB and MS-COCO dataset.
arXiv Detail & Related papers (2020-05-27T06:40:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.