IFCap: Image-like Retrieval and Frequency-based Entity Filtering for
Zero-shot Captioning
- URL: http://arxiv.org/abs/2409.18046v1
- Date: Thu, 26 Sep 2024 16:47:32 GMT
- Title: IFCap: Image-like Retrieval and Frequency-based Entity Filtering for
Zero-shot Captioning
- Authors: Soeun Lee, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim
- Abstract summary: We propose a novel approach called Image-like Retrieval, which aligns text features with visually relevant features to mitigate the modality gap.
Our method further enhances the accuracy of generated captions by designing a Fusion Module that integrates retrieved captions with input features.
- Score: 3.8265756182141346
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in image captioning have explored text-only training
methods to overcome the limitations of paired image-text data. However,
existing text-only training methods often overlook the modality gap between
using text data during training and employing images during inference. To
address this issue, we propose a novel approach called Image-like Retrieval,
which aligns text features with visually relevant features to mitigate the
modality gap. Our method further enhances the accuracy of generated captions by
designing a Fusion Module that integrates retrieved captions with input
features. Additionally, we introduce a Frequency-based Entity Filtering
technique that significantly improves caption quality. We integrate these
methods into a unified framework, which we refer to as IFCap
($\textbf{I}$mage-like Retrieval and $\textbf{F}$requency-based Entity
Filtering for Zero-shot $\textbf{Cap}$tioning). Through extensive
experimentation, our straightforward yet powerful approach has demonstrated its
efficacy, outperforming the state-of-the-art methods by a significant margin in
both image captioning and video captioning compared to zero-shot captioning
based on text-only training.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - MeaCap: Memory-Augmented Zero-shot Image Captioning [11.817667500151687]
We propose a novel Memory-Augmented zero-shot image Captioning framework (MeaCap)
MeaCap can generate concept-centered captions with fewer hallucinations and more world-knowledge.
arXiv Detail & Related papers (2024-03-06T14:00:31Z) - Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
Text-Only Training [14.340740609933437]
We propose a novel zero-shot image captioning framework with text-only training to reduce the modality gap.
In particular, we introduce a subregion feature aggregation to leverage local region information.
We extend our framework to build a zero-shot VQA pipeline, demonstrating its generality.
arXiv Detail & Related papers (2024-01-04T16:43:46Z) - Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
Captioning [13.357749288588039]
Previous works leverage the CLIP's cross-modal association ability for image captioning, relying solely on textual information under unsupervised settings.
This paper proposes a novel method to address these issues by incorporating synthetic image-text pairs.
A pre-trained text-to-image model is deployed to obtain images that correspond to textual data, and the pseudo features of generated images are optimized toward the real ones in the CLIP embedding space.
arXiv Detail & Related papers (2023-12-14T12:39:29Z) - Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text
Features [9.140696309404133]
We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings.
We show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models.
arXiv Detail & Related papers (2022-12-20T19:12:58Z) - Partially-supervised novel object captioning leveraging context from
paired data [11.215352918313577]
We create synthetic paired captioning data for novel objects by leveraging context from existing image-caption pairs.
We further re-use these partially paired images with novel objects to create pseudo-label captions.
Our approach achieves state-of-the-art results on held-out MS COCO out-of-domain test split.
arXiv Detail & Related papers (2021-09-10T21:31:42Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.