D$^2$TV: Dual Knowledge Distillation and Target-oriented Vision Modeling
for Many-to-Many Multimodal Summarization
- URL: http://arxiv.org/abs/2305.12767v2
- Date: Mon, 16 Oct 2023 10:04:37 GMT
- Title: D$^2$TV: Dual Knowledge Distillation and Target-oriented Vision Modeling
for Many-to-Many Multimodal Summarization
- Authors: Yunlong Liang, Fandong Meng, Jiaan Wang, Jinan Xu, Yufeng Chen, Jie
Zhou
- Abstract summary: Many-to-many multimodal summarization (M$3$S) task aims to generate summaries in any language with document inputs in any language and the corresponding image sequence.
We propose a dual knowledge distillation and target-oriented vision modeling framework for the M$3$S task.
- Score: 113.72253589338472
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many-to-many multimodal summarization (M$^3$S) task aims to generate
summaries in any language with document inputs in any language and the
corresponding image sequence, which essentially comprises multimodal
monolingual summarization (MMS) and multimodal cross-lingual summarization
(MXLS) tasks. Although much work has been devoted to either MMS or MXLS and has
obtained increasing attention in recent years, little research pays attention
to the M$^3$S task. Besides, existing studies mainly focus on 1) utilizing MMS
to enhance MXLS via knowledge distillation without considering the performance
of MMS or 2) improving MMS models by filtering summary-unrelated visual
features with implicit learning or explicitly complex training objectives. In
this paper, we first introduce a general and practical task, i.e., M$^3$S.
Further, we propose a dual knowledge distillation and target-oriented vision
modeling framework for the M$^3$S task. Specifically, the dual knowledge
distillation method guarantees that the knowledge of MMS and MXLS can be
transferred to each other and thus mutually prompt both of them. To offer
target-oriented visual features, a simple yet effective target-oriented
contrastive objective is designed and responsible for discarding needless
visual information. Extensive experiments on the many-to-many setting show the
effectiveness of the proposed approach. Additionally, we will contribute a
many-to-many multimodal summarization (M$^3$Sum) dataset.
Related papers
- UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia.
Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale.
We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z) - MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI [71.53579367538725]
MMT-Bench is a benchmark designed to assess Large Vision-Language Models (LVLMs) across massive multimodal tasks.
MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios.
arXiv Detail & Related papers (2024-04-24T17:37:05Z) - M&M: Multimodal-Multitask Model Integrating Audiovisual Cues in Cognitive Load Assessment [0.0]
This paper introduces the M&M model, a novel multimodal-multitask learning framework, applied to the AVCAffe dataset for cognitive load assessment.
M&M uniquely integrates audiovisual cues through a dual-pathway architecture, featuring specialized streams for audio and video inputs.
A key innovation lies in its cross-modality multihead attention mechanism, fusing the different modalities for synchronized multitasking.
arXiv Detail & Related papers (2024-03-14T14:49:40Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.
Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images.
We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z) - MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning [42.68425777473114]
Vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity.
We introduce vision-language Model with Multi-Modal In-Context Learning (MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently.
Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks.
arXiv Detail & Related papers (2023-09-14T17:59:17Z) - Summary-Oriented Vision Modeling for Multimodal Abstractive
Summarization [63.320005222549646]
Multimodal abstractive summarization (MAS) aims to produce a concise summary given the multimodal data (text and vision)
We propose to improve the summary quality through summary-oriented visual features.
Experiments on 44 languages, covering mid-high, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach.
arXiv Detail & Related papers (2022-12-15T09:05:26Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.