Visually-Aware Context Modeling for News Image Captioning
- URL: http://arxiv.org/abs/2308.08325v2
- Date: Thu, 21 Mar 2024 14:31:56 GMT
- Title: Visually-Aware Context Modeling for News Image Captioning
- Authors: Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens,
- Abstract summary: News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
- Score: 54.31708859631821
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: News Image Captioning aims to create captions from news articles and images, emphasizing the connection between textual context and visual elements. Recognizing the significance of human faces in news images and the face-name co-occurrence pattern in existing datasets, we propose a face-naming module for learning better name embeddings. Apart from names, which can be directly linked to an image area (faces), news image captions mostly contain context information that can only be found in the article. We design a retrieval strategy using CLIP to retrieve sentences that are semantically close to the image, mimicking human thought process of linking articles to images. Furthermore, to tackle the problem of the imbalanced proportion of article context and image context in captions, we introduce a simple yet effective method Contrasting with Language Model backbone (CoLaM) to the training pipeline. We conduct extensive experiments to demonstrate the efficacy of our framework. We out-perform the previous state-of-the-art (without external data) by 7.97/5.80 CIDEr scores on GoodNews/NYTimes800k. Our code is available at https://github.com/tingyu215/VACNIC.
Related papers
- Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability [5.111382868644429]
We focus on whether a news image represents the actors discussed in the news text.
We introduce NewsTT, a dataset of 1000 news thumbnail images and text pairs.
We propose CFT-CLIP, a contrastive learning framework that updates vision and language bi-encoders according to the hypothesis.
arXiv Detail & Related papers (2024-02-17T01:27:29Z) - Visual Semantic Relatedness Dataset for Image Captioning [27.788077963411624]
We propose a textual visual context dataset for captioning, in which the dataset COCO Captions has been extended with information about the scene.
This information can be used to leverage any NLP task, such as text similarity or semantic relation methods, into captioning systems.
arXiv Detail & Related papers (2023-01-20T20:04:35Z) - ANNA: Abstractive Text-to-Image Synthesis with Filtered News Captions [6.066100464517522]
Real-world image-caption pairs present in domains such as news data do not use simple and directly descriptive captions.
We launch ANNA, an Abstractive News captioNs dAtaset extracted from online news articles in a variety of different contexts.
We show that techniques such as transfer learning achieve limited success in understanding abstractive captions but still fail to consistently learn the relationships between content and context features.
arXiv Detail & Related papers (2023-01-05T17:19:01Z) - Focus! Relevant and Sufficient Context Selection for News Image
Captioning [69.36678144800936]
News Image Captioning requires describing an image by leveraging additional context from a news article.
We propose to use the pre-trained vision and language retrieval model CLIP to localize the visually grounded entities in the news article.
Our experiments demonstrate that by simply selecting a better context from the article, we can significantly improve the performance of existing models.
arXiv Detail & Related papers (2022-12-01T20:00:27Z) - I2DFormer: Learning Image to Document Attention for Zero-Shot Image
Classification [123.90912800376039]
Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
arXiv Detail & Related papers (2022-09-21T12:18:31Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding
without Text Inputs [82.93345261434943]
Given an input image, and nothing else, our method returns the bounding boxes of objects in the image and phrases that describe the objects.
This is achieved within an open world paradigm, in which the objects in the input image may not have been encountered during the training of the localization mechanism.
Our work generalizes weakly supervised segmentation and phrase grounding and is shown empirically to outperform the state of the art in both domains.
arXiv Detail & Related papers (2022-06-19T09:07:30Z) - ICECAP: Information Concentrated Entity-aware Image Captioning [41.53906032024941]
We propose an entity-aware news image captioning task to generate informative captions.
Our model first creates coarse concentration on relevant sentences using a cross-modality retrieval model.
Experiments on both BreakingNews and GoodNews datasets demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2021-08-04T13:27:51Z) - Transform and Tell: Entity-Aware News Image Captioning [77.4898875082832]
We propose an end-to-end model which generates captions for images embedded in news articles.
We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism.
We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts.
arXiv Detail & Related papers (2020-04-17T05:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.