HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
- URL: http://arxiv.org/abs/2305.16295v1
- Date: Thu, 25 May 2023 17:50:17 GMT
- Title: HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
- Authors: Chia-Wen Kuo and Zsolt Kira
- Abstract summary: We propose to regard the encodings as augmented views of the input image.
The image captioning model encodes each view independently with a shared encoder efficiently.
We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts.
- Score: 25.728621355173626
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A great deal of progress has been made in image captioning, driven by
research into how to encode the image using pre-trained models. This includes
visual encodings (e.g. image grid features or detected objects) and more
recently textual encodings (e.g. image tags or text descriptions of image
regions). As more advanced encodings are available and incorporated, it is
natural to ask: how to efficiently and effectively leverage the heterogeneous
set of encodings? In this paper, we propose to regard the encodings as
augmented views of the input image. The image captioning model encodes each
view independently with a shared encoder efficiently, and a contrastive loss is
incorporated across the encoded views in a novel way to improve their
representation quality and the model's data efficiency. Our proposed
hierarchical decoder then adaptively weighs the encoded views according to
their effectiveness for caption generation by first aggregating within each
view at the token level, and then across views at the view level. We
demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and
+12.9% CIDEr on Flickr30k compared to state of the arts, and conduct rigorous
analyses to demonstrate the importance of each part of our design.
Related papers
- FE-LWS: Refined Image-Text Representations via Decoder Stacking and Fused Encodings for Remote Sensing Image Captioning [0.15346678870160887]
This paper introduces a novel approach that integrates features from two distinct CNN based encoders.
We also propose a weighted averaging technique to combine the outputs of all GRUs in the stacked decoder.
The results demonstrate that our fusion-based approach, along with the enhanced stacked decoder, significantly outperforms both the transformer-based state-of-the-art model and other LSTM-based baselines.
arXiv Detail & Related papers (2025-02-13T12:54:13Z) - CAT: Content-Adaptive Image Tokenization [92.2116487267877]
We introduce Content-Adaptive Tokenizer (CAT), which adjusts representation capacity based on the image content and encodes simpler images into fewer tokens.
We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image.
By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.
arXiv Detail & Related papers (2025-01-06T16:28:47Z) - A TextGCN-Based Decoding Approach for Improving Remote Sensing Image Captioning [0.15346678870160887]
We propose a novel encoder-decoder setup that deploys a Text Graph Convolutional Network (TextGCN) and multi-layer LSTMs.
The embeddings generated by TextGCN enhance the decoder's understanding by capturing the semantic relationships among words at both the sentence and corpus levels.
We present an extensive evaluation of our approach against various other state-of-the-art encoder-decoder frameworks.
arXiv Detail & Related papers (2024-09-27T06:12:31Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching.
We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data.
Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z) - Towards Accurate Image Coding: Improved Autoregressive Image Generation
with Dynamic Vector Quantization [73.52943587514386]
Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm.
We propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based their information densities for accurate representation.
arXiv Detail & Related papers (2023-05-19T14:56:05Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining [68.05105411320842]
We propose a novel approach MaskOCR to unify vision and language pre-training in the classical encoder-decoder recognition framework.
We adopt the masked image modeling approach to pre-train the feature encoder using a large set of unlabeled real text images.
We transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder.
arXiv Detail & Related papers (2022-06-01T08:27:19Z) - Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings
and Data Augmentation [1.2183405753834562]
We propose to use Inception-ResNet Convolutional Neural Network as encoder to extract features from images.
We also use Hierarchical Context based Word Embeddings for word representations and a Deep Stacked Long Term Memory network as decoder.
We evaluate our proposed methods with two image captioning frameworks--Decoder and Soft Attention.
arXiv Detail & Related papers (2021-02-22T18:15:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.