MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation
of Videos
- URL: http://arxiv.org/abs/2306.04216v2
- Date: Sun, 19 Nov 2023 05:09:42 GMT
- Title: MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation
of Videos
- Authors: Jielin Qiu, Jiacheng Zhu, William Han, Aditesh Kumar, Karthik Mittal,
Claire Jin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Ding Zhao, Bo Li,
Lijuan Wang
- Abstract summary: Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction.
Numerous limitations exist within existing public MSMO datasets.
We have meticulously curated the textbfMMSum dataset.
- Score: 106.06278332186106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal summarization with multimodal output (MSMO) has emerged as a
promising research direction. Nonetheless, numerous limitations exist within
existing public MSMO datasets, including insufficient maintenance, data
inaccessibility, limited size, and the absence of proper categorization, which
pose significant challenges. To address these challenges and provide a
comprehensive dataset for this new direction, we have meticulously curated the
\textbf{MMSum} dataset. Our new dataset features (1) Human-validated summaries
for both video and textual content, providing superior human instruction and
labels for multimodal learning. (2) Comprehensively and meticulously arranged
categorization, spanning 17 principal categories and 170 subcategories to
encapsulate a diverse array of real-world scenarios. (3) Benchmark tests
performed on the proposed dataset to assess various tasks and methods,
including \textit{video summarization}, \textit{text summarization}, and
\textit{multimodal summarization}. To champion accessibility and collaboration,
we will release the \textbf{MMSum} dataset and the data collection tool as
fully open-source resources, fostering transparency and accelerating future
developments. Our project website can be found
at~\url{https://mmsum-dataset.github.io/}
Related papers
- UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos [52.161513027831646]
We focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV)
We propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously.
Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines.
arXiv Detail & Related papers (2024-06-24T03:55:25Z) - 3AM: An Ambiguity-Aware Multi-Modal Machine Translation Dataset [90.95948101052073]
We introduce 3AM, an ambiguity-aware MMT dataset comprising 26,000 parallel sentence pairs in English and Chinese.
Our dataset is specifically designed to include more ambiguity and a greater variety of both captions and images than other MMT datasets.
Experimental results show that MMT models trained on our dataset exhibit a greater ability to exploit visual information than those trained on other MMT datasets.
arXiv Detail & Related papers (2024-04-29T04:01:30Z) - Factorized Contrastive Learning: Going Beyond Multi-view Redundancy [116.25342513407173]
This paper proposes FactorCL, a new multimodal representation learning method to go beyond multi-view redundancy.
On large-scale real-world datasets, FactorCL captures both shared and unique information and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-06-08T15:17:04Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - LoRaLay: A Multilingual and Multimodal Dataset for Long Range and
Layout-Aware Summarization [19.301567079372436]
Text Summarization is a popular task and an active area of research for the Natural Language Processing community.
All publicly available summarization datasets only provide plain text content.
We present LoRaLay, a collection of datasets for long-range summarization with accompanying visual/Lay information.
arXiv Detail & Related papers (2023-01-26T18:50:54Z) - MACSA: A Multimodal Aspect-Category Sentiment Analysis Dataset with
Multimodal Fine-grained Aligned Annotations [31.972103262426877]
We propose a new dataset, the Multimodal Aspect-Category Sentiment Analysis (MACSA) dataset, which contains more than 21K text-image pairs.
Based on our dataset, we propose the Multimodal ACSA task and a multimodal graph-based aligned model (MGAM), which adopts a fine-grained cross-modal fusion method.
arXiv Detail & Related papers (2022-06-28T12:49:16Z) - See, Hear, Read: Leveraging Multimodality with Guided Attention for
Abstractive Text Summarization [14.881597737762316]
We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc.
We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
arXiv Detail & Related papers (2021-05-20T08:56:33Z) - The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset:
Collection, Insights and Improvements [14.707930573950787]
We present MuSe-CaR, a first of its kind multimodal dataset.
The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge.
arXiv Detail & Related papers (2021-01-15T10:40:37Z) - SupMMD: A Sentence Importance Model for Extractive Summarization using
Maximum Mean Discrepancy [92.5683788430012]
SupMMD is a novel technique for generic and update summarization based on the maximum discrepancy from kernel two-sample testing.
We show the efficacy of SupMMD in both generic and update summarization tasks by meeting or exceeding the current state-of-the-art on the DUC-2004 and TAC-2009 datasets.
arXiv Detail & Related papers (2020-10-06T09:26:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.