Align and Attend: Multimodal Summarization with Dual Contrastive Losses
- URL: http://arxiv.org/abs/2303.07284v3
- Date: Mon, 12 Jun 2023 18:13:44 GMT
- Title: Align and Attend: Multimodal Summarization with Dual Contrastive Losses
- Authors: Bo He, Jun Wang, Jielin Qiu, Trung Bui, Abhinav Shrivastava, Zhaowen
Wang
- Abstract summary: The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
- Score: 57.83012574678091
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of multimodal summarization is to extract the most important
information from different modalities to form output summaries. Unlike the
unimodal summarization, the multimodal summarization task explicitly leverages
cross-modal information to help generate more reliable and high-quality
summaries. However, existing methods fail to leverage the temporal
correspondence between different modalities and ignore the intrinsic
correlation between different samples. To address this issue, we introduce
Align and Attend Multimodal Summarization (A2Summ), a unified multimodal
transformer-based model which can effectively align and attend the multimodal
input. In addition, we propose two novel contrastive losses to model both
inter-sample and intra-sample correlations. Extensive experiments on two
standard video summarization datasets (TVSum and SumMe) and two multimodal
summarization datasets (Daily Mail and CNN) demonstrate the superiority of
A2Summ, achieving state-of-the-art performances on all datasets. Moreover, we
collected a large-scale multimodal summarization dataset BLiSS, which contains
livestream videos and transcribed texts with annotated summaries. Our code and
dataset are publicly available at ~\url{https://boheumd.github.io/A2Summ/}.
Related papers
- SITransformer: Shared Information-Guided Transformer for Extreme Multimodal Summarization [19.190627262112486]
Extreme Multimodal Summarization with Multimodal Output (XMSMO) becomes an attractive summarization approach.
Existing methods overlook the issue that multimodal data often contains more topic irrelevant information.
We propose SITransformer, a Shared Information-guided Transformer for extreme multimodal summarization.
arXiv Detail & Related papers (2024-08-28T14:44:42Z) - I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal
Information Extraction [10.684005956288347]
We present the Intra- and Inter-Sample Relationship Modeling (I2SRM) method for this task.
Our proposed method achieves competitive results, 77.12% F1-score on Twitter-2015, 88.40% F1-score on Twitter-2017, and 84.12% F1-score on MNRE.
arXiv Detail & Related papers (2023-10-10T05:50:25Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation
of Videos [106.06278332186106]
Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction.
Numerous limitations exist within existing public MSMO datasets.
We have meticulously curated the textbfMMSum dataset.
arXiv Detail & Related papers (2023-06-07T07:43:11Z) - Few-shot Multimodal Sentiment Analysis based on Multimodal Probabilistic
Fusion Prompts [30.15646658460899]
Multimodal sentiment analysis has gained significant attention due to the proliferation of multimodal content on social media.
Existing studies in this area rely heavily on large-scale supervised data, which is time-consuming and labor-intensive to collect.
We propose a novel method called Multimodal Probabilistic Fusion Prompts (MultiPoint) that leverages diverse cues from different modalities for multimodal sentiment detection in the few-shot scenario.
arXiv Detail & Related papers (2022-11-12T08:10:35Z) - TLDW: Extreme Multimodal Summarisation of News Videos [76.50305095899958]
We introduce eXtreme Multimodal Summarisation with Multimodal Output (XMSMO) for the scenario of TL;DW - Too Long; Didn't Watch, akin to TL;DR.
XMSMO aims to summarise a video-document pair into a summary with an extremely short length, which consists of one cover frame as the visual summary and one sentence as the textual summary.
Our method is trained, without using reference summaries, by optimising the visual and textual coverage from the perspectives of the distance between the semantic distributions under optimal transport plans.
arXiv Detail & Related papers (2022-10-16T08:19:59Z) - Hierarchical3D Adapters for Long Video-to-text Summarization [79.01926022762093]
multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
Our experiments demonstrate that multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
arXiv Detail & Related papers (2022-10-10T16:44:36Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - See, Hear, Read: Leveraging Multimodality with Guided Attention for
Abstractive Text Summarization [14.881597737762316]
We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc.
We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
arXiv Detail & Related papers (2021-05-20T08:56:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.