Related papers: MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

URL: http://arxiv.org/abs/2010.08021v1
Date: Thu, 15 Oct 2020 21:08:20 GMT
Title: MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention
Authors: Aman Khullar, Udit Arora
Abstract summary: This paper presents MAST, a new model for Multimodal Abstractive Text Summarization. We examine the usefulness and challenges of deriving information from the audio modality. We present a sequence-to-sequence trimodal hierarchical attention-based model that overcomes these challenges.
Score: 5.584060970507506
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents MAST, a new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities -- text, audio and video -- in a multimodal video. Prior work on multimodal abstractive text summarization only utilized information from the text and video modalities. We examine the usefulness and challenges of deriving information from the audio modality and present a sequence-to-sequence trimodal hierarchical attention-based model that overcomes these challenges by letting the model pay more attention to the text modality. MAST outperforms the current state of the art model (video-text) by 2.51 points in terms of Content F1 score and 1.00 points in terms of Rouge-L score on the How2 dataset for multimodal language understanding.

Related papers

Unlocking Financial Insights: An advanced Multimodal Summarization with Multimodal Output Framework for Financial Advisory Videos [11.550322270589952]
FASTER (Financial Advisory Summariser with Textual Embedded Relevant images) is a framework that produces optimized, concise summaries.<n>FASTER employs BLIP for semantic visual descriptions, OCR for textual patterns, and Whisper-based transcription with Speaker diarization as BOS features.<n>A modified Direct Preference Optimization (DPO)-based loss function, equipped with BOS-specific fact-checking, ensures precision, relevance, and factual consistency.
arXiv Detail & Related papers (2025-09-25T09:54:19Z)
TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models [123.17643568298116]
We present TAViS, a novel framework that textbfcouples the knowledge of multimodal foundation models for cross-modal alignment.<n> effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision.<n>Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
arXiv Detail & Related papers (2025-06-13T03:19:47Z)
MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion [44.45109614673675]
We create a search system that extracts text and features from both visual and audio modalities. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs.
arXiv Detail & Related papers (2025-03-26T16:28:04Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset. It contains 39,153 text-rich images, captions, and 102,437 questions. We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z)
Learning Summary-Worthy Visual Representation for Abstractive Summarization in Video [34.202514532882]
We propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization. Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary.
arXiv Detail & Related papers (2023-05-08T16:24:46Z)
Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries. Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples. We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z)
TLDW: Extreme Multimodal Summarisation of News Videos [76.50305095899958]
We introduce eXtreme Multimodal Summarisation with Multimodal Output (XMSMO) for the scenario of TL;DW - Too Long; Didn't Watch, akin to TL;DR. XMSMO aims to summarise a video-document pair into a summary with an extremely short length, which consists of one cover frame as the visual summary and one sentence as the textual summary. Our method is trained, without using reference summaries, by optimising the visual and textual coverage from the perspectives of the distance between the semantic distributions under optimal transport plans.
arXiv Detail & Related papers (2022-10-16T08:19:59Z)
Hierarchical3D Adapters for Long Video-to-text Summarization [79.01926022762093]
multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods. Our experiments demonstrate that multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
arXiv Detail & Related papers (2022-10-10T16:44:36Z)
M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval [34.343617836027725]
We propose a multi-level multi-modal hybrid fusion network to explore comprehensive interactions between text queries and each modality content in videos. Our framework provides two kinds of training strategies, including an ensemble manner and an end-to-end manner.
arXiv Detail & Related papers (2022-08-16T10:51:37Z)
See, Hear, Read: Leveraging Multimodality with Guided Attention for Abstractive Text Summarization [14.881597737762316]
We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc. We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
arXiv Detail & Related papers (2021-05-20T08:56:33Z)
Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.