Visual News: Benchmark and Challenges in News Image Captioning
- URL: http://arxiv.org/abs/2010.03743v3
- Date: Mon, 13 Sep 2021 18:53:35 GMT
- Title: Visual News: Benchmark and Challenges in News Image Captioning
- Authors: Fuxiao Liu and Yinghan Wang and Tianlu Wang and Vicente Ordonez
- Abstract summary: We propose Visual News Captioner, an entity-aware model for the task of news image captioning.
We also introduce Visual News, a large-scale benchmark consisting of more than one million news images.
- Score: 18.865262609683676
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose Visual News Captioner, an entity-aware model for the task of news
image captioning. We also introduce Visual News, a large-scale benchmark
consisting of more than one million news images along with associated news
articles, image captions, author information, and other metadata. Unlike the
standard image captioning task, news images depict situations where people,
locations, and events are of paramount importance. Our proposed method can
effectively combine visual and textual features to generate captions with
richer information such as events and entities. More specifically, built upon
the Transformer architecture, our model is further equipped with novel
multi-modal feature fusion techniques and attention mechanisms, which are
designed to generate named entities more accurately. Our method utilizes much
fewer parameters while achieving slightly better prediction results than
competing methods. Our larger and more diverse Visual News dataset further
highlights the remaining challenges in captioning news images.
Related papers
- Image Captioning in news report scenario [12.42658463552019]
We explore the realm of image captioning specifically tailored for celebrity photographs.
This exploration aims to augment automated news content generation, thereby facilitating a more nuanced dissemination of information.
arXiv Detail & Related papers (2024-03-24T16:08:10Z) - Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability [5.111382868644429]
We focus on whether a news image represents the actors discussed in the news text.
We introduce NewsTT, a dataset of 1000 news thumbnail images and text pairs.
We propose CFT-CLIP, a contrastive learning framework that updates vision and language bi-encoders according to the hypothesis.
arXiv Detail & Related papers (2024-02-17T01:27:29Z) - Video Summarization: Towards Entity-Aware Captions [73.28063602552741]
We propose the task of summarizing news video directly to entity-aware captions.
We show that our approach generalizes to existing news image captions dataset.
arXiv Detail & Related papers (2023-12-01T23:56:00Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - Focus! Relevant and Sufficient Context Selection for News Image
Captioning [69.36678144800936]
News Image Captioning requires describing an image by leveraging additional context from a news article.
We propose to use the pre-trained vision and language retrieval model CLIP to localize the visually grounded entities in the news article.
Our experiments demonstrate that by simply selecting a better context from the article, we can significantly improve the performance of existing models.
arXiv Detail & Related papers (2022-12-01T20:00:27Z) - Word-Level Fine-Grained Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story with a global consistency across dynamic scenes and characters.
Current works still struggle with output images' quality and consistency, and rely on additional semantic information or auxiliary captioning networks.
We first introduce a new sentence representation, which incorporates word information from all story sentences to mitigate the inconsistency problem.
Then, we propose a new discriminator with fusion features to improve image quality and story consistency.
arXiv Detail & Related papers (2022-08-03T21:01:47Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Journalistic Guidelines Aware News Image Captioning [8.295819830685536]
News article image captioning aims to generate descriptive and informative captions for news article images.
Unlike conventional image captions that simply describe the content of the image in general terms, news image captions rely heavily on named entities to describe the image content.
We propose a new approach to this task, motivated by caption guidelines that journalists follow.
arXiv Detail & Related papers (2021-09-07T04:49:50Z) - Transform and Tell: Entity-Aware News Image Captioning [77.4898875082832]
We propose an end-to-end model which generates captions for images embedded in news articles.
We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism.
We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts.
arXiv Detail & Related papers (2020-04-17T05:44:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.