How Good is a Video Summary? A New Benchmarking Dataset and Evaluation
Framework Towards Realistic Video Summarization
- URL: http://arxiv.org/abs/2101.10514v1
- Date: Tue, 26 Jan 2021 01:42:55 GMT
- Title: How Good is a Video Summary? A New Benchmarking Dataset and Evaluation
Framework Towards Realistic Video Summarization
- Authors: Vishal Kaushal, Suraj Kothawade, Anshul Tomar, Rishabh Iyer, Ganesh
Ramakrishnan
- Abstract summary: We introduce a new benchmarking video dataset called VISIOCITY which comprises of longer videos across six different categories.
We show strategies to automatically generate multiple reference summaries from indirect ground truth present in VISIOCITY.
We propose an evaluation framework for better quantitative assessment of summary quality which is closer to human judgment.
- Score: 11.320914099324492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic video summarization is still an unsolved problem due to several
challenges. The currently available datasets either have very short videos or
have few long videos of only a particular type. We introduce a new benchmarking
video dataset called VISIOCITY (VIdeo SummarIzatiOn based on Continuity, Intent
and DiversiTY) which comprises of longer videos across six different categories
with dense concept annotations capable of supporting different flavors of video
summarization and other vision problems. For long videos, human reference
summaries necessary for supervised video summarization techniques are difficult
to obtain. We explore strategies to automatically generate multiple reference
summaries from indirect ground truth present in VISIOCITY. We show that these
summaries are at par with human summaries. We also present a study of different
desired characteristics of a good summary and demonstrate how it is normal to
have two good summaries with different characteristics. Thus we argue that
evaluating a summary against one or more human summaries and using a single
measure has its shortcomings. We propose an evaluation framework for better
quantitative assessment of summary quality which is closer to human judgment.
Lastly, we present insights into how a model can be enhanced to yield better
summaries. Sepcifically, when multiple diverse ground truth summaries can
exist, learning from them individually and using a combination of loss
functions measuring different characteristics is better than learning from a
single combined (oracle) ground truth summary using a single loss function. We
demonstrate the effectiveness of doing so as compared to some of the
representative state of the art techniques tested on VISIOCITY. We release
VISIOCITY as a benchmarking dataset and invite researchers to test the
effectiveness of their video summarization algorithms on VISIOCITY.
Related papers
- Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - A Modular Approach for Multimodal Summarization of TV Shows [55.20132267309382]
We present a modular approach where separate components perform specialized sub-tasks.
Our modules involve detecting scene boundaries, reordering scenes so as to minimize the number of cuts between different events, converting visual information to text, summarizing the dialogue in each scene, and fusing the scene summaries into a final summary for the entire episode.
We also present a new metric, PRISMA, to measure both precision and recall of generated summaries, which we decompose into atomic facts.
arXiv Detail & Related papers (2024-03-06T16:10:01Z) - Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story.
Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video.
A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z) - Learning Summary-Worthy Visual Representation for Abstractive
Summarization in Video [34.202514532882]
We propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization.
Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary.
arXiv Detail & Related papers (2023-05-08T16:24:46Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - Mitigating Representation Bias in Action Recognition: Algorithms and
Benchmarks [76.35271072704384]
Deep learning models perform poorly when applied to videos with rare scenes or objects.
We tackle this problem from two different angles: algorithm and dataset.
We show that the debiased representation can generalize better when transferred to other datasets and tasks.
arXiv Detail & Related papers (2022-09-20T00:30:35Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z) - Unsupervised Video Summarization via Multi-source Features [4.387757291346397]
Video summarization aims at generating a compact yet representative visual summary that conveys the essence of the original video.
We propose the incorporation of multiple feature sources with chunk and stride fusion to provide more information about the visual content.
For a comprehensive evaluation on the two benchmarks TVSum and SumMe, we compare our method with four state-of-the-art approaches.
arXiv Detail & Related papers (2021-05-26T13:12:46Z) - Realistic Video Summarization through VISIOCITY: A New Benchmark and
Evaluation Framework [15.656965429236235]
We take steps towards making automatic video summarization more realistic by addressing several challenges.
Firstly, the currently available datasets either have very short videos or have few long videos of only a particular type.
We introduce a new benchmarking dataset VISIOCITY which comprises of longer videos across six different categories.
arXiv Detail & Related papers (2020-07-29T02:44:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.