Related papers: Video Summarization: Towards Entity-Aware Captions

Video Summarization: Towards Entity-Aware Captions

URL: http://arxiv.org/abs/2312.02188v1
Date: Fri, 1 Dec 2023 23:56:00 GMT
Title: Video Summarization: Towards Entity-Aware Captions
Authors: Hammad A. Ayyubi, Tianqi Liu, Arsha Nagrani, Xudong Lin, Mingda Zhang, Anurag Arnab, Feng Han, Yukun Zhu, Jialu Liu, Shih-Fu Chang
Abstract summary: We propose the task of summarizing news video directly to entity-aware captions. We show that our approach generalizes to existing news image captions dataset.
Score: 75.71891605682931
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Existing popular video captioning benchmarks and models deal with generic captions devoid of specific person, place or organization named entities. In contrast, news videos present a challenging setting where the caption requires such named entities for meaningful summarization. As such, we propose the task of summarizing news video directly to entity-aware captions. We also release a large-scale dataset, VIEWS (VIdeo NEWS), to support research on this task. Further, we propose a method that augments visual information from videos with context retrieved from external world knowledge to generate entity-aware captions. We demonstrate the effectiveness of our approach on three video captioning models. We also show that our approach generalizes to existing news image captions dataset. With all the extensive experiments and insights, we believe we establish a solid basis for future research on this challenging task.

Related papers

Controllable Hybrid Captioner for Improved Long-form Video Understanding [0.24578723416255746]
Video data is extremely dense and high-dimensional.<n>Text-based summaries of video content offer a way to represent content in a much more compact manner than raw.<n>We introduce Vision Language Models (VLMs) to enrich the memory with static scene descriptions.
arXiv Detail & Related papers (2025-07-22T22:09:00Z)
Grounded Video Caption Generation [74.23767687855279]
We propose a new task, dataset and model for grounded video caption generation. This task unifies captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally consistent bounding boxes. We introduce a new grounded video caption generation model, called VideoGround, and train the model on the new automatically annotated HowToGround dataset.
arXiv Detail & Related papers (2024-11-12T06:44:24Z)
Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video. Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset. We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS) To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z)
CLIP-It! Language-Guided Video Summarization [96.69415453447166]
This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization. We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another. Our model can be extended to the unsupervised setting by training without ground-truth supervision.
arXiv Detail & Related papers (2021-07-01T17:59:27Z)
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z)
Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
Enriching Video Captions With Contextual Text [9.994985014558383]
We propose an end-to-end sequence-to-sequence model which generates video captions based on visual input. We do not preprocess the text further, and let the model directly learn to attend over it.
arXiv Detail & Related papers (2020-07-29T08:58:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.