Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual
Context for Image Captioning
- URL: http://arxiv.org/abs/2205.04363v1
- Date: Mon, 9 May 2022 15:05:24 GMT
- Title: Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual
Context for Image Captioning
- Authors: Chia-Wen Kuo, Zsolt Kira
- Abstract summary: Key limitation of current methods is that the output of the model is conditioned only on the object detector's outputs.
We propose to add an auxiliary input to represent missing information such as object relationships.
We validate our method on image captioning, perform thorough analyses of each component and importance of the pre-trained multi-modal model, and demonstrate significant improvements over the current state of the art.
- Score: 25.728621355173626
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Significant progress has been made on visual captioning, largely relying on
pre-trained features and later fixed object detectors that serve as rich inputs
to auto-regressive models. A key limitation of such methods, however, is that
the output of the model is conditioned only on the object detector's outputs.
The assumption that such outputs can represent all necessary information is
unrealistic, especially when the detector is transferred across datasets. In
this work, we reason about the graphical model induced by this assumption, and
propose to add an auxiliary input to represent missing information such as
object relationships. We specifically propose to mine attributes and
relationships from the Visual Genome dataset and condition the captioning model
on them. Crucially, we propose (and show to be important) the use of a
multi-modal pre-trained model (CLIP) to retrieve such contextual descriptions.
Further, object detector models are frozen and do not have sufficient richness
to allow the captioning model to properly ground them. As a result, we propose
to condition both the detector and description outputs on the image, and show
qualitatively and quantitatively that this can improve grounding. We validate
our method on image captioning, perform thorough analyses of each component and
importance of the pre-trained multi-modal model, and demonstrate significant
improvements over the current state of the art, specifically +7.5% in CIDEr and
+1.3% in BLEU-4 metrics.
Related papers
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Learning Embeddings with Centroid Triplet Loss for Object Identification in Robotic Grasping [14.958823096408175]
Foundation models are a strong trend in deep learning and computer vision.
Here, we focus on training such an object identification model.
Key solution to train such a model is the centroid triplet loss (CTL), which aggregates image features to their centroids.
arXiv Detail & Related papers (2024-04-09T13:01:26Z) - Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models.
We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective.
The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - FuseCap: Leveraging Large Language Models for Enriched Fused Image
Captions [11.274127953112574]
We propose an automated approach to augmenting existing captions with visual details using "frozen" vision experts.
Our proposed method, FuseCap, fuses the outputs of such vision experts with the original captions using a large language model.
We release this large-scale dataset of enriched image-caption pairs for the community.
arXiv Detail & Related papers (2023-05-28T13:16:03Z) - Few-shot Domain-Adaptive Visually-fused Event Detection from Text [13.189886554546929]
We present a novel domain-adaptive visually-fused event detection approach that can be trained on a few labelled image-text paired data points.
Specifically, we introduce a visual imaginator method that synthesises images from text in the absence of visual context.
Our model can leverage the capabilities of pre-trained vision-language models and can be trained in a few-shot setting.
arXiv Detail & Related papers (2023-05-04T00:10:57Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - A Comprehensive Study of Image Classification Model Sensitivity to
Foregrounds, Backgrounds, and Visual Attributes [58.633364000258645]
We call this dataset RIVAL10 consisting of roughly $26k$ instances over $10$ classes.
We evaluate the sensitivity of a broad set of models to noise corruptions in foregrounds, backgrounds and attributes.
In our analysis, we consider diverse state-of-the-art architectures (ResNets, Transformers) and training procedures (CLIP, SimCLR, DeiT, Adversarial Training)
arXiv Detail & Related papers (2022-01-26T06:31:28Z) - MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding [40.24656027709833]
We propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query.
We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model.
Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR.
arXiv Detail & Related papers (2021-04-26T17:55:33Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.