Multi-Image Summarization: Textual Summary from a Set of Cohesive Images
- URL: http://arxiv.org/abs/2006.08686v1
- Date: Mon, 15 Jun 2020 18:45:35 GMT
- Title: Multi-Image Summarization: Textual Summary from a Set of Cohesive Images
- Authors: Nicholas Trieu, Sebastian Goodman, Pradyumna Narayana, Kazoo Sone,
Radu Soricut
- Abstract summary: This paper proposes the new task of multi-image summarization.
It aims to generate a concise and descriptive textual summary given a coherent set of input images.
A dense average image feature aggregation network allows the model to focus on a coherent subset of attributes.
- Score: 17.688344968462275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-sentence summarization is a well studied problem in NLP, while
generating image descriptions for a single image is a well studied problem in
Computer Vision. However, for applications such as image cluster labeling or
web page summarization, summarizing a set of images is also a useful and
challenging task. This paper proposes the new task of multi-image
summarization, which aims to generate a concise and descriptive textual summary
given a coherent set of input images. We propose a model that extends the
image-captioning Transformer-based architecture for single image to
multi-image. A dense average image feature aggregation network allows the model
to focus on a coherent subset of attributes across the input images. We explore
various input representations to the Transformer network and empirically show
that aggregated image features are superior to individual image embeddings. We
additionally show that the performance of the model is further improved by
pretraining the model parameters on a single-image captioning task, which
appears to be particularly effective in eliminating hallucinations in the
output.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - Many-to-many Image Generation with Auto-regressive Diffusion Models [59.5041405824704]
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images.
We present MIS, a novel large-scale multi-image dataset, containing 12M synthetic multi-image samples, each with 25 interconnected images.
We learn M2M, an autoregressive model for many-to-many generation, where each image is modeled within a diffusion framework.
arXiv Detail & Related papers (2024-04-03T23:20:40Z) - Towards Better Multi-modal Keyphrase Generation via Visual Entity
Enhancement and Multi-granularity Image Noise Filtering [79.44443231700201]
Multi-modal keyphrase generation aims to produce a set of keyphrases that represent the core points of the input text-image pair.
The input text and image are often not perfectly matched, and thus the image may introduce noise into the model.
We propose a novel multi-modal keyphrase generation model, which not only enriches the model input with external knowledge, but also effectively filters image noise.
arXiv Detail & Related papers (2023-09-09T09:41:36Z) - MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and
Unpaired Text-based Image Captioning [46.4308182215488]
A text-based image intuitively contains abundant and complex multimodal relational content.
We propose the Multimodal relAtional Graph adversarIal inferenCe framework for diverse and unpaired TextCap.
We validate the effectiveness of MAGIC in generating diverse captions from different relational information items of an image.
arXiv Detail & Related papers (2021-12-13T11:00:49Z) - Meta Internal Learning [88.68276505511922]
Internal learning for single-image generation is a framework, where a generator is trained to produce novel images based on a single image.
We propose a meta-learning approach that enables training over a collection of images, in order to model the internal statistics of the sample image more effectively.
Our results show that the models obtained are as suitable as single-image GANs for many common image applications.
arXiv Detail & Related papers (2021-10-06T16:27:38Z) - UniMS: A Unified Framework for Multimodal Summarization with Knowledge
Distillation [43.15662489492694]
We propose a Unified framework for Multimodal Summarization grounding on BART, UniMS.
We adopt knowledge distillation from a vision-language pretrained model to improve image selection.
Our best model achieves a new state-of-the-art result on a large-scale benchmark dataset.
arXiv Detail & Related papers (2021-09-13T09:36:04Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Fine-grained Semantic Constraint in Image Synthesis [8.22379888383833]
We propose a multi-stage and high-resolution model for image synthesis that uses fine-grained attributes and masks as input.
With mask as prior, the model in this paper is constrained so that the generated images conform to visual senses.
This paper also proposes a scheme to improve the discriminator of the generative adversarial network by simultaneously discriminating the total image and sub-regions of the image.
arXiv Detail & Related papers (2021-01-12T15:51:49Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.