NewsStories: Illustrating articles with visual summaries
- URL: http://arxiv.org/abs/2207.13061v1
- Date: Tue, 26 Jul 2022 17:34:11 GMT
- Title: NewsStories: Illustrating articles with visual summaries
- Authors: Reuben Tan, Bryan A. Plummer, Kate Saenko, JP Lewis, Avneesh Sud,
Thomas Leung
- Abstract summary: We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
- Score: 49.924916589209374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent self-supervised approaches have used large-scale image-text datasets
to learn powerful representations that transfer to many tasks without
finetuning. These methods often assume that there is one-to-one correspondence
between its images and their (short) captions. However, many tasks require
reasoning about multiple images and long text narratives, such as describing
news articles with visual summaries. Thus, we explore a novel setting where the
goal is to learn a self-supervised visual-language representation that is
robust to varying text length and the number of images. In addition, unlike
prior work which assumed captions have a literal relation to the image, we
assume images only contain loose illustrative correspondence with the text. To
explore this problem, we introduce a large-scale multimodal dataset containing
over 31M articles, 22M images and 1M videos. We show that state-of-the-art
image-text alignment methods are not robust to longer narratives with multiple
images. Finally, we introduce an intuitive baseline that outperforms these
methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - DreamLIP: Language-Image Pre-training with Long Captions [42.4063624671045]
We re-caption 30M images with detailed descriptions using a pre-trained Multi-modality Large Language Model (MLLM)
Motivated by this, we propose to dynamically sample sub-captions from the text label to construct multiple positive pairs.
It is noteworthy that, on the tasks of image-text retrieval and semantic segmentation, our model trained with 30M image-text pairs achieves on par or even better performance than CLIP trained with 400M pairs.
arXiv Detail & Related papers (2024-03-25T17:59:42Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding [6.4901484665257545]
We propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data.
Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets.
arXiv Detail & Related papers (2020-01-11T05:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.