Summary-Oriented Vision Modeling for Multimodal Abstractive
Summarization
- URL: http://arxiv.org/abs/2212.07672v2
- Date: Thu, 4 May 2023 10:16:30 GMT
- Title: Summary-Oriented Vision Modeling for Multimodal Abstractive
Summarization
- Authors: Yunlong Liang, Fandong Meng, Jinan Xu, Jiaan Wang, Yufeng Chen, Jie
Zhou
- Abstract summary: Multimodal abstractive summarization (MAS) aims to produce a concise summary given the multimodal data (text and vision)
We propose to improve the summary quality through summary-oriented visual features.
Experiments on 44 languages, covering mid-high, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach.
- Score: 63.320005222549646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal abstractive summarization (MAS) aims to produce a concise summary
given the multimodal data (text and vision). Existing studies mainly focus on
how to effectively use the visual features from the perspective of an article,
having achieved impressive success on the high-resource English dataset.
However, less attention has been paid to the visual features from the
perspective of the summary, which may limit the model performance, especially
in the low- and zero-resource scenarios. In this paper, we propose to improve
the summary quality through summary-oriented visual features. To this end, we
devise two auxiliary tasks including vision to summary task and masked image
modeling task. Together with the main summarization task, we optimize the MAS
model via the training objectives of all these tasks. By these means, the MAS
model can be enhanced by capturing the summary-oriented visual features,
thereby yielding more accurate summaries. Experiments on 44 languages, covering
mid-high-, low-, and zero-resource scenarios, verify the effectiveness and
superiority of the proposed approach, which achieves state-of-the-art
performance under all scenarios. Additionally, we will contribute a large-scale
multilingual multimodal abstractive summarization (MM-Sum) dataset.
Related papers
- Leveraging the Power of LLMs: A Fine-Tuning Approach for High-Quality Aspect-Based Summarization [25.052557735932535]
Large language models (LLMs) have demonstrated the potential to revolutionize diverse tasks within natural language processing.
This paper explores the potential of fine-tuning LLMs for the aspect-based summarization task.
We evaluate the impact of fine-tuning open-source foundation LLMs, including Llama2, Mistral, Gemma and Aya, on a publicly available domain-specific aspect based summary dataset.
arXiv Detail & Related papers (2024-08-05T16:00:21Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Veagle: Advancements in Multimodal Representation Learning [0.0]
This paper introduces a novel approach to enhance the multimodal capabilities of existing models.
Our proposed model Veagle, incorporates a unique mechanism inspired by the successes and insights of previous works.
Our results indicate a improvement of 5-6 % in performance, with Veagle outperforming existing models by a notable margin.
arXiv Detail & Related papers (2024-01-18T12:45:25Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - D$^2$TV: Dual Knowledge Distillation and Target-oriented Vision Modeling
for Many-to-Many Multimodal Summarization [113.72253589338472]
Many-to-many multimodal summarization (M$3$S) task aims to generate summaries in any language with document inputs in any language and the corresponding image sequence.
We propose a dual knowledge distillation and target-oriented vision modeling framework for the M$3$S task.
arXiv Detail & Related papers (2023-05-22T06:47:35Z) - Learning Summary-Worthy Visual Representation for Abstractive
Summarization in Video [34.202514532882]
We propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization.
Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary.
arXiv Detail & Related papers (2023-05-08T16:24:46Z) - UniMS: A Unified Framework for Multimodal Summarization with Knowledge
Distillation [43.15662489492694]
We propose a Unified framework for Multimodal Summarization grounding on BART, UniMS.
We adopt knowledge distillation from a vision-language pretrained model to improve image selection.
Our best model achieves a new state-of-the-art result on a large-scale benchmark dataset.
arXiv Detail & Related papers (2021-09-13T09:36:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.