Fine-Grained Image Generation from Bangla Text Description using
Attentional Generative Adversarial Network
- URL: http://arxiv.org/abs/2109.11749v1
- Date: Fri, 24 Sep 2021 05:31:01 GMT
- Title: Fine-Grained Image Generation from Bangla Text Description using
Attentional Generative Adversarial Network
- Authors: Md Aminul Haque Palash, Md Abdullah Al Nasim, Aditi Dhali, Faria Afrin
- Abstract summary: We propose Bangla Attentional Generative Adversarial Network (AttnGAN) that allows intensified, multi-stage processing for high-resolution Bangla text-to-image generation.
For the first time, a fine-grained image is generated from Bangla text using attentional GAN.
- Score: 0.0
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Generating fine-grained, realistic images from text has many applications in
the visual and semantic realm. Considering that, we propose Bangla Attentional
Generative Adversarial Network (AttnGAN) that allows intensified, multi-stage
processing for high-resolution Bangla text-to-image generation. Our model can
integrate the most specific details at different sub-regions of the image. We
distinctively concentrate on the relevant words in the natural language
description. This framework has achieved a better inception score on the CUB
dataset. For the first time, a fine-grained image is generated from Bangla text
using attentional GAN. Bangla has achieved 7th position among 100 most spoken
languages. This inspires us to explicitly focus on this language, which will
ensure the inevitable need of many people. Moreover, Bangla has a more complex
syntactic structure and less natural language processing resource that
validates our work more.
Related papers
- An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance [53.974497865647336]
We take a first step towards translating images to make them culturally relevant.
We build three pipelines comprising state-of-the-art generative models to do the task.
We conduct a human evaluation of translated images to assess for cultural relevance and meaning preservation.
arXiv Detail & Related papers (2024-04-01T17:08:50Z) - LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image
Generation [121.45667242282721]
We propose a coarse-to-fine paradigm to achieve layout planning and image generation.
Our proposed method outperforms the state-of-the-art models in terms of photorealistic layout and image generation.
arXiv Detail & Related papers (2023-08-09T17:45:04Z) - PLIP: Language-Image Pre-training for Person Representation Learning [51.348303233290025]
We propose a novel language-image pre-training framework for person representation learning, termed PLIP.
To implement our framework, we construct a large-scale person dataset with image-text pairs named SYNTH-PEDES.
PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings.
arXiv Detail & Related papers (2023-05-15T06:49:00Z) - GlyphDiffusion: Text Generation as Image Generation [100.98428068214736]
We propose GlyphDiffusion, a novel diffusion approach for text generation via text-guided image generation.
Our key idea is to render the target text as a glyph image containing visual language content.
Our model also makes significant improvements compared to the recent diffusion model.
arXiv Detail & Related papers (2023-04-25T02:14:44Z) - Indonesian Text-to-Image Synthesis with Sentence-BERT and FastGAN [0.0]
We use Sentence BERT as the text encoder and FastGAN as the image generator.
We translate the CUB dataset into Bahasa using google translate and manually by humans.
FastGAN uses lots of skip excitation modules and auto-encoder to generate an image with resolution 512x512x3, which is twice as bigger as the current state-of-the-art model.
arXiv Detail & Related papers (2023-03-25T16:54:22Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - Incongruity Detection between Bangla News Headline and Body Content
through Graph Neural Network [0.0]
Incongruity between news headlines and body content is a common method of deception used to attract readers.
We propose a graph-based hierarchical dual encoder model that learns the content similarity and contradiction between Bangla news headlines and content paragraphs effectively.
The proposed Bangla graph-based neural network model achieves above 90% accuracy on various Bangla news datasets.
arXiv Detail & Related papers (2022-10-26T20:57:45Z) - BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset [0.5893124686141781]
Resource-constrained languages like Bangla remain out of focus, predominantly due to a lack of standard datasets.
We present a new dataset BAN-Cap following the widely used Flickr8k dataset, where we collect Bangla captions of the images provided by qualified annotators.
We investigate the effect of text augmentation and demonstrate that an adaptive attention-based model combined with text augmentation using Contextualized Word Replacement (CWR) outperforms all state-of-the-art models for Bangla image captioning.
arXiv Detail & Related papers (2022-05-28T15:39:09Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Visually Grounded Reasoning across Languages and Cultures [27.31020761908739]
We develop a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures.
We focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish.
We create a multilingual dataset for Multicultural Reasoning over Vision and Language (MaRVL) by eliciting statements from native speaker annotators about pairs of images.
arXiv Detail & Related papers (2021-09-28T16:51:38Z) - TextMage: The Automated Bangla Caption Generator Based On Deep Learning [1.2330326247154968]
TextMage is a system that is capable of understanding visual scenes that belong to the Bangladeshi geographical context.
This dataset contains 9,154 images along with two annotations for each image.
arXiv Detail & Related papers (2020-10-15T23:24:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.