Indonesian Text-to-Image Synthesis with Sentence-BERT and FastGAN
- URL: http://arxiv.org/abs/2303.14517v1
- Date: Sat, 25 Mar 2023 16:54:22 GMT
- Title: Indonesian Text-to-Image Synthesis with Sentence-BERT and FastGAN
- Authors: Made Raharja Surya Mahadi and Nugraha Priya Utama
- Abstract summary: We use Sentence BERT as the text encoder and FastGAN as the image generator.
We translate the CUB dataset into Bahasa using google translate and manually by humans.
FastGAN uses lots of skip excitation modules and auto-encoder to generate an image with resolution 512x512x3, which is twice as bigger as the current state-of-the-art model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Currently, text-to-image synthesis uses text encoder and image generator
architecture. Research on this topic is challenging. This is because of the
domain gap between natural language and vision. Nowadays, most research on this
topic only focuses on producing a photo-realistic image, but the other domain,
in this case, is the language, which is less concentrated. A lot of the current
research uses English as the input text. Besides, there are many languages
around the world. Bahasa Indonesia, as the official language of Indonesia, is
quite popular. This language has been taught in Philipines, Australia, and
Japan. Translating or recreating a new dataset into another language with good
quality will cost a lot. Research on this domain is necessary because we need
to examine how the image generator performs in other languages besides
generating photo-realistic images. To achieve this, we translate the CUB
dataset into Bahasa using google translate and manually by humans. We use
Sentence BERT as the text encoder and FastGAN as the image generator. FastGAN
uses lots of skip excitation modules and auto-encoder to generate an image with
resolution 512x512x3, which is twice as bigger as the current state-of-the-art
model (Zhang, Xu, Li, Zhang, Wang, Huang and Metaxas, 2019). We also get 4.76
+- 0.43 and 46.401 on Inception Score and Fr\'echet inception distance,
respectively, and comparable with the current English text-to-image generation
models. The mean opinion score also gives as 3.22 out of 5, which means the
generated image is acceptable by humans. Link to source code:
https://github.com/share424/Indonesian-Text-to-Image-synthesis-with-Sentence-BERT-and-FastGAN
Related papers
- OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding [57.22231959529641]
Hunyuan-DiT is a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese.
For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images.
arXiv Detail & Related papers (2024-05-14T16:33:25Z) - An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance [53.974497865647336]
We take a first step towards translating images to make them culturally relevant.
We build three pipelines comprising state-of-the-art generative models to do the task.
We conduct a human evaluation of translated images to assess for cultural relevance and meaning preservation.
arXiv Detail & Related papers (2024-04-01T17:08:50Z) - Learning to Imagine: Visually-Augmented Natural Language Generation [73.65760028876943]
We propose a method to make pre-trained language models (PLMs) Learn to Imagine for Visuallyaugmented natural language gEneration.
We use a diffusion model to synthesize high-quality images conditioned on the input texts.
We conduct synthesis for each sentence rather than generate only one image for an entire paragraph.
arXiv Detail & Related papers (2023-05-26T13:59:45Z) - CoBIT: A Contrastive Bi-directional Image-Text Generation Model [72.1700346308106]
CoBIT employs a novel unicoder-decoder structure, which attempts to unify three pre-training objectives in one framework.
CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios.
arXiv Detail & Related papers (2023-03-23T17:24:31Z) - Text to Image Generation: Leaving no Language Behind [6.243995448840211]
We study how the performance of three popular text-to-image generators depends on the language.
The results show that there is a significant performance degradation when using languages other than English.
This is fundamental to ensure that this new technology can be used by non-native English speakers.
arXiv Detail & Related papers (2022-08-19T13:24:56Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - OptGAN: Optimizing and Interpreting the Latent Space of the Conditional
Text-to-Image GANs [8.26410341981427]
We study how to ensure that generated samples are believable, realistic or natural.
We present a novel algorithm which identifies semantically-understandable directions in the latent space of a conditional text-to-image GAN architecture.
arXiv Detail & Related papers (2022-02-25T20:00:33Z) - Fine-Grained Image Generation from Bangla Text Description using
Attentional Generative Adversarial Network [0.0]
We propose Bangla Attentional Generative Adversarial Network (AttnGAN) that allows intensified, multi-stage processing for high-resolution Bangla text-to-image generation.
For the first time, a fine-grained image is generated from Bangla text using attentional GAN.
arXiv Detail & Related papers (2021-09-24T05:31:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.