Testing chatbots on the creation of encoders for audio conditioned image generation
- URL: http://arxiv.org/abs/2509.09717v1
- Date: Tue, 09 Sep 2025 18:57:10 GMT
- Title: Testing chatbots on the creation of encoders for audio conditioned image generation
- Authors: Jorge E. León, Miguel Carrasco,
- Abstract summary: We explore whether state-of-the-art conversational agents can design effective audio encoders to replace the CLIP text encoder.<n>We created a public demo so everyone could study and try out these audio encoders.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: On one hand, recent advances in chatbots has led to a rising popularity in using these models for coding tasks. On the other hand, modern generative image models primarily rely on text encoders to translate semantic concepts into visual representations, even when there is clear evidence that audio can be employed as input as well. Given the previous, in this work, we explore whether state-of-the-art conversational agents can design effective audio encoders to replace the CLIP text encoder from Stable Diffusion 1.5, enabling image synthesis directly from sound. We prompted five publicly available chatbots to propose neural architectures to work as these audio encoders, with a set of well-explained shared conditions. Each valid suggested encoder was trained on over two million context related audio-image-text observations, and evaluated on held-out validation and test sets using various metrics, together with a qualitative analysis of their generated images. Although almost all chatbots generated valid model designs, none achieved satisfactory results, indicating that their audio embeddings failed to align reliably with those of the original text encoder. Among the proposals, the Gemini audio encoder showed the best quantitative metrics, while the Grok audio encoder produced more coherent images (particularly, when paired with the text encoder). Our findings reveal a shared architectural bias across chatbots and underscore the remaining coding gap that needs to be bridged in future versions of these models. We also created a public demo so everyone could study and try out these audio encoders. Finally, we propose research questions that should be tackled in the future, and encourage other researchers to perform more focused and highly specialized tasks like this one, so the respective chatbots cannot make use of well-known solutions and their creativity/reasoning is fully tested.
Related papers
- FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks [12.446324804274628]
FocalCodec is an efficient low-bitrate based on focal modulation that utilizes a single binary codebook to compress speech.<n>Demo samples, code and checkpoints are available at https://lucadellalib.io/focalcodec-web/.
arXiv Detail & Related papers (2025-02-06T19:24:50Z) - LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT [65.69648099999439]
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks.
We propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation.
arXiv Detail & Related papers (2023-10-07T03:17:59Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - Decoder-Only or Encoder-Decoder? Interpreting Language Model as a
Regularized Encoder-Decoder [75.03283861464365]
The seq2seq task aims at generating the target sequence based on the given input source sequence.
Traditionally, most of the seq2seq task is resolved by an encoder to encode the source sequence and a decoder to generate the target text.
Recently, a bunch of new approaches have emerged that apply decoder-only language models directly to the seq2seq task.
arXiv Detail & Related papers (2023-04-08T15:44:29Z) - Combining Automatic Speaker Verification and Prosody Analysis for
Synthetic Speech Detection [15.884911752869437]
We present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice.
On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task.
On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder.
arXiv Detail & Related papers (2022-10-31T11:03:03Z) - Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder.
The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Distilling Knowledge for Fast Retrieval-based Chat-bots [6.284464997330884]
We propose a new cross-encoder architecture and transfer knowledge from this model to a bi-encoder model using distillation.
This effectively boosts bi-encoder performance at no cost during inference time.
arXiv Detail & Related papers (2020-04-23T09:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.