Video-CSR: Complex Video Digest Creation for Visual-Language Models
- URL: http://arxiv.org/abs/2310.05060v1
- Date: Sun, 8 Oct 2023 08:02:43 GMT
- Title: Video-CSR: Complex Video Digest Creation for Visual-Language Models
- Authors: Tingkai Liu, Yunzhe Tao, Haogeng Liu, Qihang Fan, Ding Zhou, Huaibo
Huang, Ran He, Hongxia Yang
- Abstract summary: We present a novel task and human annotated dataset for evaluating the ability for visual-language models to generate captions and summaries for real-world video clips.
The dataset contains 4.8K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
- Score: 71.66614561702131
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a novel task and human annotated dataset for evaluating the
ability for visual-language models to generate captions and summaries for
real-world video clips, which we call Video-CSR (Captioning, Summarization and
Retrieval). The dataset contains 4.8K YouTube video clips of 20-60 seconds in
duration and covers a wide range of topics and interests. Each video clip
corresponds to 5 independently annotated captions (1 sentence) and summaries
(3-10 sentences). Given any video selected from the dataset and its
corresponding ASR information, we evaluate visual-language models on either
caption or summary generation that is grounded in both the visual and auditory
content of the video. Additionally, models are also evaluated on caption- and
summary-based retrieval tasks, where the summary-based retrieval task requires
the identification of a target video given excerpts of a corresponding summary.
Given the novel nature of the paragraph-length video summarization task, we
perform extensive comparative analyses of different existing evaluation metrics
and their alignment with human preferences. Finally, we propose a foundation
model with competitive generation and retrieval capabilities that serves as a
baseline for the Video-CSR task. We aim for Video-CSR to serve as a useful
evaluation set in the age of large language models and complex multi-modal
tasks.
Related papers
- Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - Video Summarization Based on Video-text Modelling [0.0]
We propose a multimodal self-supervised learning framework to obtain semantic representations of videos.
We also introduce a progressive video summarization method, where the important content in a video is pinpointed progressively to generate better summaries.
An objective evaluation framework is proposed to measure the quality of video summaries based on video classification.
arXiv Detail & Related papers (2022-01-07T15:21:46Z) - CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization.
We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another.
Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.