Focus! Relevant and Sufficient Context Selection for News Image
Captioning
- URL: http://arxiv.org/abs/2212.00843v1
- Date: Thu, 1 Dec 2022 20:00:27 GMT
- Title: Focus! Relevant and Sufficient Context Selection for News Image
Captioning
- Authors: Mingyang Zhou, Grace Luo, Anna Rohrbach, Zhou Yu
- Abstract summary: News Image Captioning requires describing an image by leveraging additional context from a news article.
We propose to use the pre-trained vision and language retrieval model CLIP to localize the visually grounded entities in the news article.
Our experiments demonstrate that by simply selecting a better context from the article, we can significantly improve the performance of existing models.
- Score: 69.36678144800936
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: News Image Captioning requires describing an image by leveraging additional
context from a news article. Previous works only coarsely leverage the article
to extract the necessary context, which makes it challenging for models to
identify relevant events and named entities. In our paper, we first demonstrate
that by combining more fine-grained context that captures the key named
entities (obtained via an oracle) and the global context that summarizes the
news, we can dramatically improve the model's ability to generate accurate news
captions. This begs the question, how to automatically extract such key
entities from an image? We propose to use the pre-trained vision and language
retrieval model CLIP to localize the visually grounded entities in the news
article and then capture the non-visual entities via an open relation
extraction model. Our experiments demonstrate that by simply selecting a better
context from the article, we can significantly improve the performance of
existing models and achieve new state-of-the-art performance on multiple
benchmarks.
Related papers
- CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification [9.996589403019675]
Person re-identification (ReID) has recently benefited from large pretrained vision-language models such as Contrastive Language-Image Pre-Training (CLIP)
We propose one straightforward solution by leveraging existing image captioning models to generate pseudo captions for person images.
We introduce CLIP-SCGI, a framework that leverages synthesized captions to guide the learning of discriminative and robust representations.
arXiv Detail & Related papers (2024-10-12T06:24:33Z) - Towards Retrieval-Augmented Architectures for Image Captioning [81.11529834508424]
This work presents a novel approach towards developing image captioning models that utilize an external kNN memory to improve the generation process.
Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities.
We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions.
arXiv Detail & Related papers (2024-05-21T18:02:07Z) - Shatter and Gather: Learning Referring Image Segmentation with Text
Supervision [52.46081425504072]
We present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent.
Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
arXiv Detail & Related papers (2023-08-29T15:39:15Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - COSA: Concatenated Sample Pretrained Vision-Language Foundation Model [78.32081709802873]
Most vision-language foundation models employ image-text datasets for pretraining.
We propose COSA, a COncatenated SAmple pretrained vision-language foundation model.
We achieve this by sequentially concatenating multiple image-text pairs as inputs for pretraining.
This transformation effectively converts existing image-text corpora into a pseudo long-form video-paragraph corpus.
arXiv Detail & Related papers (2023-06-15T12:29:42Z) - "Let's not Quote out of Context": Unified Vision-Language Pretraining
for Context Assisted Image Captioning [40.01197694624958]
We propose a new unified Vision-Language (VL) model based on the One For All (OFA) model.
Our approach aims to overcome the context-independent (image and text are treated independently) nature of the existing approaches.
Our system achieves state-of-the-art results with an improvement of up to 8.34 CIDEr score on the benchmark news image captioning datasets.
arXiv Detail & Related papers (2023-06-01T17:34:25Z) - Visual News: Benchmark and Challenges in News Image Captioning [18.865262609683676]
We propose Visual News Captioner, an entity-aware model for the task of news image captioning.
We also introduce Visual News, a large-scale benchmark consisting of more than one million news images.
arXiv Detail & Related papers (2020-10-08T03:07:00Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.