See, Hear, Read: Leveraging Multimodality with Guided Attention for
Abstractive Text Summarization
- URL: http://arxiv.org/abs/2105.09601v1
- Date: Thu, 20 May 2021 08:56:33 GMT
- Title: See, Hear, Read: Leveraging Multimodality with Guided Attention for
Abstractive Text Summarization
- Authors: Yash Kumar Atri, Shraman Pramanick, Vikram Goyal, Tanmoy Chakraborty
- Abstract summary: We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc.
We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
- Score: 14.881597737762316
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, abstractive text summarization with multimodal inputs has
started drawing attention due to its ability to accumulate information from
different source modalities and generate a fluent textual summary. However,
existing methods use short videos as the visual modality and short summary as
the ground-truth, therefore, perform poorly on lengthy videos and long
ground-truth summary. Additionally, there exists no benchmark dataset to
generalize this task on videos of varying lengths. In this paper, we introduce
AVIATE, the first large-scale dataset for abstractive text summarization with
videos of diverse duration, compiled from presentations in well-known academic
conferences like NDSS, ICML, NeurIPS, etc. We use the abstract of corresponding
research papers as the reference summaries, which ensure adequate quality and
uniformity of the ground-truth. We then propose {\name}, a factorized
multi-modal Transformer based decoder-only language model, which inherently
captures the intra-modal and inter-modal dynamics within various input
modalities for the text summarization task. {\name} utilizes an increasing
number of self-attentions to capture multimodality and performs significantly
better than traditional encoder-decoder based networks. Extensive experiments
illustrate that {\name} achieves significant improvement over the baselines in
both qualitative and quantitative evaluations on the existing How2 dataset for
short videos and newly introduced AVIATE dataset for videos with diverse
duration, beating the best baseline on the two datasets by $1.39$ and $2.74$
ROUGE-L points respectively.
Related papers
- VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos [52.161513027831646]
We focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV)
We propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously.
Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines.
arXiv Detail & Related papers (2024-06-24T03:55:25Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - Learning Summary-Worthy Visual Representation for Abstractive
Summarization in Video [34.202514532882]
We propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization.
Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary.
arXiv Detail & Related papers (2023-05-08T16:24:46Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - TLDW: Extreme Multimodal Summarisation of News Videos [76.50305095899958]
We introduce eXtreme Multimodal Summarisation with Multimodal Output (XMSMO) for the scenario of TL;DW - Too Long; Didn't Watch, akin to TL;DR.
XMSMO aims to summarise a video-document pair into a summary with an extremely short length, which consists of one cover frame as the visual summary and one sentence as the textual summary.
Our method is trained, without using reference summaries, by optimising the visual and textual coverage from the perspectives of the distance between the semantic distributions under optimal transport plans.
arXiv Detail & Related papers (2022-10-16T08:19:59Z) - Hierarchical3D Adapters for Long Video-to-text Summarization [79.01926022762093]
multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
Our experiments demonstrate that multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
arXiv Detail & Related papers (2022-10-10T16:44:36Z) - MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical
Attention [5.584060970507506]
This paper presents MAST, a new model for Multimodal Abstractive Text Summarization.
We examine the usefulness and challenges of deriving information from the audio modality.
We present a sequence-to-sequence trimodal hierarchical attention-based model that overcomes these challenges.
arXiv Detail & Related papers (2020-10-15T21:08:20Z) - Multi-modal Summarization for Video-containing Documents [23.750585762568665]
We propose a novel multi-modal summarization task to summarize from a document and its associated video.
Comprehensive experiments show that the proposed model is beneficial for multi-modal summarization and superior to existing methods.
arXiv Detail & Related papers (2020-09-17T02:13:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.