TextMage: The Automated Bangla Caption Generator Based On Deep Learning
- URL: http://arxiv.org/abs/2010.08066v1
- Date: Thu, 15 Oct 2020 23:24:15 GMT
- Title: TextMage: The Automated Bangla Caption Generator Based On Deep Learning
- Authors: Abrar Hasin Kamal, Md. Asifuzzaman Jishan, and Nafees Mansoor
- Abstract summary: TextMage is a system that is capable of understanding visual scenes that belong to the Bangladeshi geographical context.
This dataset contains 9,154 images along with two annotations for each image.
- Score: 1.2330326247154968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural Networks and Deep Learning have seen an upsurge of research in the
past decade due to the improved results. Generates text from the given image is
a crucial task that requires the combination of both sectors which are computer
vision and natural language processing in order to understand an image and
represent it using a natural language. However existing works have all been
done on a particular lingual domain and on the same set of data. This leads to
the systems being developed to perform poorly on images that belong to specific
locales' geographical context. TextMage is a system that is capable of
understanding visual scenes that belong to the Bangladeshi geographical context
and use its knowledge to represent what it understands in Bengali. Hence, we
have trained a model on our previously developed and published dataset named
BanglaLekhaImageCaptions. This dataset contains 9,154 images along with two
annotations for each image. In order to access performance, the proposed model
has been implemented and evaluated.
Related papers
- ProGEO: Generating Prompts through Image-Text Contrastive Learning for Visual Geo-localization [0.0]
We propose a two-stage training method to enhance visual performance and use contrastive learning to mine challenging samples.
We validate the effectiveness of the proposed strategy on several large-scale visual geo-localization datasets.
arXiv Detail & Related papers (2024-06-04T02:28:51Z) - TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance [53.974497865647336]
We take a first step towards translating images to make them culturally relevant.
We build three pipelines comprising state-of-the-art generative models to do the task.
We conduct a human evaluation of translated images to assess for cultural relevance and meaning preservation.
arXiv Detail & Related papers (2024-04-01T17:08:50Z) - BAN-Cap: A Multi-Purpose English-Bangla Image Descriptions Dataset [0.5893124686141781]
Resource-constrained languages like Bangla remain out of focus, predominantly due to a lack of standard datasets.
We present a new dataset BAN-Cap following the widely used Flickr8k dataset, where we collect Bangla captions of the images provided by qualified annotators.
We investigate the effect of text augmentation and demonstrate that an adaptive attention-based model combined with text augmentation using Contextualized Word Replacement (CWR) outperforms all state-of-the-art models for Bangla image captioning.
arXiv Detail & Related papers (2022-05-28T15:39:09Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - Fine-Grained Image Generation from Bangla Text Description using
Attentional Generative Adversarial Network [0.0]
We propose Bangla Attentional Generative Adversarial Network (AttnGAN) that allows intensified, multi-stage processing for high-resolution Bangla text-to-image generation.
For the first time, a fine-grained image is generated from Bangla text using attentional GAN.
arXiv Detail & Related papers (2021-09-24T05:31:01Z) - Context-Aware Image Inpainting with Learned Semantic Priors [100.99543516733341]
We introduce pretext tasks that are semantically meaningful to estimating the missing contents.
We propose a context-aware image inpainting model, which adaptively integrates global semantics and local features.
arXiv Detail & Related papers (2021-06-14T08:09:43Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Learning to Represent Image and Text with Denotation Graph [32.417311523031195]
We propose learning representations from a set of implied, visually grounded expressions between image and text.
We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations.
arXiv Detail & Related papers (2020-10-06T18:00:58Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - PhraseCut: Language-based Image Segmentation in the Wild [62.643450401286]
We consider the problem of segmenting image regions given a natural language phrase.
Our dataset is collected on top of the Visual Genome dataset.
Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art.
arXiv Detail & Related papers (2020-08-03T20:58:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.