Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation
Protocols
- URL: http://arxiv.org/abs/2311.02538v1
- Date: Sun, 5 Nov 2023 01:45:31 GMT
- Title: Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation
Protocols
- Authors: Iqra Qasim, Alexander Horsch, Dilip K. Prasad
- Abstract summary: Untrimmed videos have interrelated events, dependencies, context, overlapping events, object-object interactions, domain specificity, and other semantics worth describing.
Video Captioning (DVC) aims at detecting and describing different events in a given video.
- Score: 53.706461356853445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Untrimmed videos have interrelated events, dependencies, context, overlapping
events, object-object interactions, domain specificity, and other semantics
that are worth highlighting while describing a video in natural language. Owing
to such a vast diversity, a single sentence can only correctly describe a
portion of the video. Dense Video Captioning (DVC) aims at detecting and
describing different events in a given video. The term DVC originated in the
2017 ActivityNet challenge, after which considerable effort has been made to
address the challenge. Dense Video Captioning is divided into three sub-tasks:
(1) Video Feature Extraction (VFE), (2) Temporal Event Localization (TEL), and
(3) Dense Caption Generation (DCG). This review aims to discuss all the studies
that claim to perform DVC along with its sub-tasks and summarize their results.
We also discuss all the datasets that have been used for DVC. Lastly, we
highlight some emerging challenges and future trends in the field.
Related papers
- ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models [53.9661582975843]
Video Temporal Grounding aims to ground specific segments within an untrimmed video corresponding to a given natural language query.
Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive and prone to human biases.
We present ChatVTG, a novel approach that utilizes Video Dialogue Large Language Models (LLMs) for zero-shot video temporal grounding.
arXiv Detail & Related papers (2024-10-01T08:27:56Z) - A Survey of Video Datasets for Grounded Event Understanding [34.11140286628736]
multimodal AI systems must be capable of well-rounded common-sense reasoning akin to human visual understanding.
We survey 105 video datasets that require event understanding capability.
arXiv Detail & Related papers (2024-06-14T00:36:55Z) - Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - A Review of Deep Learning for Video Captioning [111.1557921247882]
Video captioning (VC) is a fast-moving, cross-disciplinary area of research.
This survey covers deep learning-based VC, including but not limited to, attention-based architectures, graph networks, reinforcement learning, adversarial networks, dense video captioning (DVC)
arXiv Detail & Related papers (2023-04-22T15:30:54Z) - Grounded Video Situation Recognition [37.279915290069326]
We present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions.
Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly.
arXiv Detail & Related papers (2022-10-19T18:38:10Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video
Captioning and Video Question Answering [0.0]
We propose iPer, a framework capable of understanding the "why" between events in a video.
We demonstrate the effectiveness of iPerceive and VideoQA as machine translation problems.
Our approach furthers the state-of-the-art in visual understanding.
arXiv Detail & Related papers (2020-11-16T05:44:45Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.