A Modular Approach for Multimodal Summarization of TV Shows
- URL: http://arxiv.org/abs/2403.03823v9
- Date: Thu, 22 Aug 2024 10:00:53 GMT
- Title: A Modular Approach for Multimodal Summarization of TV Shows
- Authors: Louis Mahon, Mirella Lapata,
- Abstract summary: We present a modular approach where separate components perform specialized sub-tasks.
Our modules involve detecting scene boundaries, reordering scenes so as to minimize the number of cuts between different events, converting visual information to text, summarizing the dialogue in each scene, and fusing the scene summaries into a final summary for the entire episode.
We also present a new metric, PRISMA, to measure both precision and recall of generated summaries, which we decompose into atomic facts.
- Score: 55.20132267309382
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper we address the task of summarizing television shows, which touches key areas in AI research: complex reasoning, multiple modalities, and long narratives. We present a modular approach where separate components perform specialized sub-tasks which we argue affords greater flexibility compared to end-to-end methods. Our modules involve detecting scene boundaries, reordering scenes so as to minimize the number of cuts between different events, converting visual information to text, summarizing the dialogue in each scene, and fusing the scene summaries into a final summary for the entire episode. We also present a new metric, PRISMA (Precision and Recall EvaluatIon of Summary FActs), to measure both precision and recall of generated summaries, which we decompose into atomic facts. Tested on the recently released SummScreen3D dataset, our method produces higher quality summaries than comparison models, as measured with ROUGE and our new fact-based metric, and as assessed by human evaluators.
Related papers
- Personalized Video Summarization using Text-Based Queries and Conditional Modeling [3.4447129363520337]
This thesis explores enhancing video summarization by integrating text-based queries and conditional modeling.
Evaluation metrics such as accuracy and F1-score assess the quality of the generated summaries.
arXiv Detail & Related papers (2024-08-27T02:43:40Z) - Hierarchical3D Adapters for Long Video-to-text Summarization [79.01926022762093]
multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
Our experiments demonstrate that multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
arXiv Detail & Related papers (2022-10-10T16:44:36Z) - ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive
Summarization with Argument Mining [61.82562838486632]
We crowdsource four new datasets on diverse online conversation forms of news comments, discussion forums, community question answering forums, and email threads.
We benchmark state-of-the-art models on our datasets and analyze characteristics associated with the data.
arXiv Detail & Related papers (2021-06-01T22:17:13Z) - Controllable Abstractive Dialogue Summarization with Sketch Supervision [56.59357883827276]
Our model achieves state-of-the-art performance on the largest dialogue summarization corpus SAMSum, with as high as 50.79 in ROUGE-L score.
arXiv Detail & Related papers (2021-05-28T19:05:36Z) - How Good is a Video Summary? A New Benchmarking Dataset and Evaluation
Framework Towards Realistic Video Summarization [11.320914099324492]
We introduce a new benchmarking video dataset called VISIOCITY which comprises of longer videos across six different categories.
We show strategies to automatically generate multiple reference summaries from indirect ground truth present in VISIOCITY.
We propose an evaluation framework for better quantitative assessment of summary quality which is closer to human judgment.
arXiv Detail & Related papers (2021-01-26T01:42:55Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z) - Realistic Video Summarization through VISIOCITY: A New Benchmark and
Evaluation Framework [15.656965429236235]
We take steps towards making automatic video summarization more realistic by addressing several challenges.
Firstly, the currently available datasets either have very short videos or have few long videos of only a particular type.
We introduce a new benchmarking dataset VISIOCITY which comprises of longer videos across six different categories.
arXiv Detail & Related papers (2020-07-29T02:44:35Z) - Screenplay Summarization Using Latent Narrative Structure [78.45316339164133]
We propose to explicitly incorporate the underlying structure of narratives into general unsupervised and supervised extractive summarization models.
We formalize narrative structure in terms of key narrative events (turning points) and treat it as latent in order to summarize screenplays.
Experimental results on the CSI corpus of TV screenplays, which we augment with scene-level summarization labels, show that latent turning points correlate with important aspects of a CSI episode.
arXiv Detail & Related papers (2020-04-27T11:54:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.