Related papers: What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

URL: http://arxiv.org/abs/2502.08279v2
Date: Mon, 17 Feb 2025 12:01:02 GMT
Title: What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations
Authors: Dongqi Liu, Chenxi Whitehouse, Xi Yu, Louis Mahon, Rohit Saxena, Zheng Zhao, Yifu Qiu, Mirella Lapata, Vera Demberg,
Abstract summary: This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains.<n>We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts.
Score: 47.79536652721794
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of scientific video summarization.

Related papers

ARLED: Leveraging LED-based ARMAN Model for Abstractive Summarization of Persian Long Documents [0.0]
Authors introduce a new dataset of 300,000 full-text Persian papers obtained from the Ensani website.<n>They apply the ARMAN model, based on the Longformer architecture, to generate summaries.<n>Results demonstrate promising performance in Persian text summarization.
arXiv Detail & Related papers (2025-03-13T10:16:46Z)
Abstractive Text Summarization: State of the Art, Challenges, and Improvements [6.349503549199403]
This review takes a comprehensive approach encompassing state-of-the-art methods, challenges, solutions, comparisons, limitations and charts out future improvements. The paper highlights challenges such as inadequate meaning representation, factual consistency, controllable text summarization, cross-lingual summarization, and evaluation metrics.
arXiv Detail & Related papers (2024-09-04T03:39:23Z)
Personalized Video Summarization using Text-Based Queries and Conditional Modeling [3.4447129363520337]
This thesis explores enhancing video summarization by integrating text-based queries and conditional modeling. Evaluation metrics such as accuracy and F1-score assess the quality of the generated summaries.
arXiv Detail & Related papers (2024-08-27T02:43:40Z)
Enhancing Video Summarization with Context Awareness [9.861215740353247]
Video summarization automatically generate concise summaries by selecting techniques, shots, or segments that capture the video's essence. Despite the importance of video summarization, there is a lack of diverse and representative datasets. We propose an unsupervised approach that leverages video data structure and information for generating informative summaries.
arXiv Detail & Related papers (2024-04-06T09:08:34Z)
Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset. We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z)
Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story. Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video. A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z)
Enhancing Argument Structure Extraction with Efficient Leverage of Contextual Information [79.06082391992545]
We propose an Efficient Context-aware model (ECASE) that fully exploits contextual information. We introduce a sequence-attention module and distance-weighted similarity loss to aggregate contextual information and argumentative information. Our experiments on five datasets from various domains demonstrate that our model achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-10-08T08:47:10Z)
Learning Summary-Worthy Visual Representation for Abstractive Summarization in Video [34.202514532882]
We propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization. Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary.
arXiv Detail & Related papers (2023-05-08T16:24:46Z)
Discourse Analysis for Evaluating Coherence in Video Paragraph Captions [99.37090317971312]
We are exploring a novel discourse based framework to evaluate the coherence of video paragraphs. Central to our approach is the discourse representation of videos, which helps in modeling coherence of paragraphs conditioned on coherence of videos. Our experiment results have shown that the proposed framework evaluates coherence of video paragraphs significantly better than all the baseline methods.
arXiv Detail & Related papers (2022-01-17T04:23:08Z)
ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining [61.82562838486632]
We crowdsource four new datasets on diverse online conversation forms of news comments, discussion forums, community question answering forums, and email threads. We benchmark state-of-the-art models on our datasets and analyze characteristics associated with the data.
arXiv Detail & Related papers (2021-06-01T22:17:13Z)
How Good is a Video Summary? A New Benchmarking Dataset and Evaluation Framework Towards Realistic Video Summarization [11.320914099324492]
We introduce a new benchmarking video dataset called VISIOCITY which comprises of longer videos across six different categories. We show strategies to automatically generate multiple reference summaries from indirect ground truth present in VISIOCITY. We propose an evaluation framework for better quantitative assessment of summary quality which is closer to human judgment.
arXiv Detail & Related papers (2021-01-26T01:42:55Z)
Unsupervised Opinion Summarization with Content Planning [58.5308638148329]
We show that explicitly incorporating content planning in a summarization model yields output of higher quality. We also create synthetic datasets which are more natural, resembling real world document-summary pairs. Our approach outperforms competitive models in generating informative, coherent, and fluent summaries.
arXiv Detail & Related papers (2020-12-14T18:41:58Z)
Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language. We generate abstractive summaries of narrated instructional videos across a wide variety of topics. We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
Realistic Video Summarization through VISIOCITY: A New Benchmark and Evaluation Framework [15.656965429236235]
We take steps towards making automatic video summarization more realistic by addressing several challenges. Firstly, the currently available datasets either have very short videos or have few long videos of only a particular type. We introduce a new benchmarking dataset VISIOCITY which comprises of longer videos across six different categories.
arXiv Detail & Related papers (2020-07-29T02:44:35Z)
Object Relational Graph with Teacher-Recommended Learning for Video Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy. Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation. Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.