Learning Summary-Worthy Visual Representation for Abstractive
Summarization in Video
- URL: http://arxiv.org/abs/2305.04824v1
- Date: Mon, 8 May 2023 16:24:46 GMT
- Title: Learning Summary-Worthy Visual Representation for Abstractive
Summarization in Video
- Authors: Zenan Xu, Xiaojun Meng, Yasheng Wang, Qinliang Su, Zexuan Qiu, Xin
Jiang, Qun Liu
- Abstract summary: We propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization.
Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary.
- Score: 34.202514532882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal abstractive summarization for videos (MAS) requires generating a
concise textual summary to describe the highlights of a video according to
multimodal resources, in our case, the video content and its transcript.
Inspired by the success of the large-scale generative pre-trained language
model (GPLM) in generating high-quality textual content (e.g., summary), recent
MAS methods have proposed to adapt the GPLM to this task by equipping it with
the visual information, which is often obtained through a general-purpose
visual feature extractor. However, the generally extracted visual features may
overlook some summary-worthy visual information, which impedes model
performance. In this work, we propose a novel approach to learning the
summary-worthy visual representation that facilitates abstractive
summarization. Our method exploits the summary-worthy information from both the
cross-modal transcript data and the knowledge that distills from the pseudo
summary. Extensive experiments on three public multimodal datasets show that
our method outperforms all competing baselines. Furthermore, with the
advantages of summary-worthy visual information, our model can have a
significant improvement on small datasets or even datasets with limited
training data.
Related papers
- Personalized Video Summarization using Text-Based Queries and Conditional Modeling [3.4447129363520337]
This thesis explores enhancing video summarization by integrating text-based queries and conditional modeling.
Evaluation metrics such as accuracy and F1-score assess the quality of the generated summaries.
arXiv Detail & Related papers (2024-08-27T02:43:40Z) - UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos [52.161513027831646]
We focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV)
We propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously.
Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines.
arXiv Detail & Related papers (2024-06-24T03:55:25Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z) - Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story.
Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video.
A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z) - Summary-Oriented Vision Modeling for Multimodal Abstractive
Summarization [63.320005222549646]
Multimodal abstractive summarization (MAS) aims to produce a concise summary given the multimodal data (text and vision)
We propose to improve the summary quality through summary-oriented visual features.
Experiments on 44 languages, covering mid-high, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach.
arXiv Detail & Related papers (2022-12-15T09:05:26Z) - Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z) - StreamHover: Livestream Transcript Summarization and Annotation [54.41877742041611]
We present StreamHover, a framework for annotating and summarizing livestream transcripts.
With a total of over 500 hours of videos annotated with both extractive and abstractive summaries, our benchmark dataset is significantly larger than currently existing annotated corpora.
We show that our model generalizes better and improves performance over strong baselines.
arXiv Detail & Related papers (2021-09-11T02:19:37Z) - See, Hear, Read: Leveraging Multimodality with Guided Attention for
Abstractive Text Summarization [14.881597737762316]
We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc.
We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
arXiv Detail & Related papers (2021-05-20T08:56:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.