UniMS: A Unified Framework for Multimodal Summarization with Knowledge
Distillation
- URL: http://arxiv.org/abs/2109.05812v1
- Date: Mon, 13 Sep 2021 09:36:04 GMT
- Title: UniMS: A Unified Framework for Multimodal Summarization with Knowledge
Distillation
- Authors: Zhengkun Zhang, Xiaojun Meng, Yasheng Wang, Xin Jiang, Qun Liu,
Zhenglu Yang
- Abstract summary: We propose a Unified framework for Multimodal Summarization grounding on BART, UniMS.
We adopt knowledge distillation from a vision-language pretrained model to improve image selection.
Our best model achieves a new state-of-the-art result on a large-scale benchmark dataset.
- Score: 43.15662489492694
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid increase of multimedia data, a large body of literature has
emerged to work on multimodal summarization, the majority of which target at
refining salient information from textual and visual modalities to output a
pictorial summary with the most relevant images. Existing methods mostly focus
on either extractive or abstractive summarization and rely on qualified image
captions to build image references. We are the first to propose a Unified
framework for Multimodal Summarization grounding on BART, UniMS, that
integrates extractive and abstractive objectives, as well as selecting the
image output. Specially, we adopt knowledge distillation from a vision-language
pretrained model to improve image selection, which avoids any requirement on
the existence and quality of image captions. Besides, we introduce a visual
guided decoder to better integrate textual and visual modalities in guiding
abstractive text generation. Results show that our best model achieves a new
state-of-the-art result on a large-scale benchmark dataset. The newly involved
extractive objective as well as the knowledge distillation technique are proven
to bring a noticeable improvement to the multimodal summarization task.
Related papers
- Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Learning Summary-Worthy Visual Representation for Abstractive
Summarization in Video [34.202514532882]
We propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization.
Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary.
arXiv Detail & Related papers (2023-05-08T16:24:46Z) - Summary-Oriented Vision Modeling for Multimodal Abstractive
Summarization [63.320005222549646]
Multimodal abstractive summarization (MAS) aims to produce a concise summary given the multimodal data (text and vision)
We propose to improve the summary quality through summary-oriented visual features.
Experiments on 44 languages, covering mid-high, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach.
arXiv Detail & Related papers (2022-12-15T09:05:26Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Multi-Image Summarization: Textual Summary from a Set of Cohesive Images [17.688344968462275]
This paper proposes the new task of multi-image summarization.
It aims to generate a concise and descriptive textual summary given a coherent set of input images.
A dense average image feature aggregation network allows the model to focus on a coherent subset of attributes.
arXiv Detail & Related papers (2020-06-15T18:45:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.