Towards Accurate Text-based Image Captioning with Content Diversity
Exploration
- URL: http://arxiv.org/abs/2105.03236v1
- Date: Fri, 23 Apr 2021 08:57:47 GMT
- Title: Towards Accurate Text-based Image Captioning with Content Diversity
Exploration
- Authors: Guanghui Xu, Shuaicheng Niu, Mingkui Tan, Yucheng Luo, Qing Du, Qi Wu
- Abstract summary: Text-based image captioning (TextCap) which aims to read and reason images with texts is crucial for a machine to understand a detailed and complex scene environment.
Existing methods attempt to extend the traditional image captioning methods to solve this task, which focus on describing the overall scene of images by one global caption.
This is infeasible because the complex text and visual information cannot be described well within one caption.
- Score: 46.061291298616354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based image captioning (TextCap) which aims to read and reason images
with texts is crucial for a machine to understand a detailed and complex scene
environment, considering that texts are omnipresent in daily life. This task,
however, is very challenging because an image often contains complex texts and
visual information that is hard to be described comprehensively. Existing
methods attempt to extend the traditional image captioning methods to solve
this task, which focus on describing the overall scene of images by one global
caption. This is infeasible because the complex text and visual information
cannot be described well within one caption. To resolve this difficulty, we
seek to generate multiple captions that accurately describe different parts of
an image in detail. To achieve this purpose, there are three key challenges: 1)
it is hard to decide which parts of the texts of images to copy or paraphrase;
2) it is non-trivial to capture the complex relationship between diverse texts
in an image; 3) how to generate multiple captions with diverse content is still
an open problem. To conquer these, we propose a novel Anchor-Captioner method.
Specifically, we first find the important tokens which are supposed to be paid
more attention to and consider them as anchors. Then, for each chosen anchor,
we group its relevant texts to construct the corresponding anchor-centred graph
(ACG). Last, based on different ACGs, we conduct multi-view caption generation
to improve the content diversity of generated captions. Experimental results
show that our method not only achieves SOTA performance but also generates
diverse captions to describe images.
Related papers
- TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance [17.251982243534144]
LAR-Gen is a novel approach for image inpainting that enables seamless inpainting of masked scene images.
Our approach adopts a coarse-to-fine manner to ensure subject identity preservation and local semantic coherence.
Experiments and varied application scenarios demonstrate the superiority of LAR-Gen in terms of both identity preservation and text semantic consistency.
arXiv Detail & Related papers (2024-03-28T16:07:55Z) - Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - Generating image captions with external encyclopedic knowledge [1.452875650827562]
We create an end-to-end caption generation system that makes extensive use of image-specific encyclopedic data.
Our approach includes a novel way of using image location to identify relevant open-domain facts in an external knowledge base.
Our system is trained and tested on a new dataset with naturally produced knowledge-rich captions.
arXiv Detail & Related papers (2022-10-10T16:09:21Z) - Word-Level Fine-Grained Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story with a global consistency across dynamic scenes and characters.
Current works still struggle with output images' quality and consistency, and rely on additional semantic information or auxiliary captioning networks.
We first introduce a new sentence representation, which incorporates word information from all story sentences to mitigate the inconsistency problem.
Then, we propose a new discriminator with fusion features to improve image quality and story consistency.
arXiv Detail & Related papers (2022-08-03T21:01:47Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and
Unpaired Text-based Image Captioning [46.4308182215488]
A text-based image intuitively contains abundant and complex multimodal relational content.
We propose the Multimodal relAtional Graph adversarIal inferenCe framework for diverse and unpaired TextCap.
We validate the effectiveness of MAGIC in generating diverse captions from different relational information items of an image.
arXiv Detail & Related papers (2021-12-13T11:00:49Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.