Video Summarization: Towards Entity-Aware Captions
- URL: http://arxiv.org/abs/2312.02188v1
- Date: Fri, 1 Dec 2023 23:56:00 GMT
- Title: Video Summarization: Towards Entity-Aware Captions
- Authors: Hammad A. Ayyubi, Tianqi Liu, Arsha Nagrani, Xudong Lin, Mingda Zhang,
Anurag Arnab, Feng Han, Yukun Zhu, Jialu Liu, Shih-Fu Chang
- Abstract summary: We propose the task of summarizing news video directly to entity-aware captions.
We show that our approach generalizes to existing news image captions dataset.
- Score: 75.71891605682931
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing popular video captioning benchmarks and models deal with generic
captions devoid of specific person, place or organization named entities. In
contrast, news videos present a challenging setting where the caption requires
such named entities for meaningful summarization. As such, we propose the task
of summarizing news video directly to entity-aware captions. We also release a
large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task.
Further, we propose a method that augments visual information from videos with
context retrieved from external world knowledge to generate entity-aware
captions. We demonstrate the effectiveness of our approach on three video
captioning models. We also show that our approach generalizes to existing news
image captions dataset. With all the extensive experiments and insights, we
believe we establish a solid basis for future research on this challenging
task.
Related papers
- Language-Guided Self-Supervised Video Summarization Using Text Semantic Matching Considering the Diversity of the Video [22.60291297308379]
This paper proposes a novel self-supervised framework for video summarization guided by Large Language Models (LLMs)
Our model achieves competitive results against other state-of-the-art methods and paves a novel pathway in video summarization.
arXiv Detail & Related papers (2024-05-14T18:07:04Z) - Subject-Oriented Video Captioning [64.08594243670296]
We propose a new video captioning task, subject-oriented video captioning, which allows users to specify the describing target via a bounding box.
We construct two subject-oriented video captioning datasets based on two widely used video captioning datasets: MSVD and MSRVTT.
As the first attempt, we evaluate four state-of-the-art general video captioning models, and have observed a large performance drop.
arXiv Detail & Related papers (2023-12-20T17:44:32Z) - Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video.
Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset.
We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS)
To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z) - CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization.
We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another.
Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Enriching Video Captions With Contextual Text [9.994985014558383]
We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input.
We do not preprocess the text further, and let the model directly learn to attend over it.
arXiv Detail & Related papers (2020-07-29T08:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.