Video Captioning: a comparative review of where we are and which could
be the route
- URL: http://arxiv.org/abs/2204.05976v2
- Date: Wed, 13 Apr 2022 16:13:43 GMT
- Title: Video Captioning: a comparative review of where we are and which could
be the route
- Authors: Daniela Moctezuma, Tania Ram\'irez-delReal, Guillermo Ruiz, Oth\'on
Gonz\'alez-Ch\'avez
- Abstract summary: Video captioning is the process of describing the content of a sequence of images capturing its semantic relationships and meanings.
This manuscript presents an extensive review of more than 105 papers for the period of 2016 to 2021.
- Score: 0.21301560294088315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video captioning is the process of describing the content of a sequence of
images capturing its semantic relationships and meanings. Dealing with this
task with a single image is arduous, not to mention how difficult it is for a
video (or images sequence). The amount and relevance of the applications of
video captioning are vast, mainly to deal with a significant amount of video
recordings in video surveillance, or assisting people visually impaired, to
mention a few. To analyze where the efforts of our community to solve the video
captioning task are, as well as what route could be better to follow, this
manuscript presents an extensive review of more than 105 papers for the period
of 2016 to 2021. As a result, the most-used datasets and metrics are
identified. Also, the main approaches used and the best ones. We compute a set
of rankings based on several performance metrics to obtain, according to its
performance, the best method with the best result on the video captioning task.
Finally, some insights are concluded about which could be the next steps or
opportunity areas to improve dealing with this complex task.
Related papers
- AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [73.62572976072578]
We propose AuroraCap, a video captioner based on a large multimodal model.
We implement the token merging strategy, reducing the number of input visual tokens.
AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z) - Shot2Story20K: A New Benchmark for Comprehensive Understanding of
Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - Video Summarization: Towards Entity-Aware Captions [73.28063602552741]
We propose the task of summarizing news video directly to entity-aware captions.
We show that our approach generalizes to existing news image captions dataset.
arXiv Detail & Related papers (2023-12-01T23:56:00Z) - A Video is Worth 10,000 Words: Training and Benchmarking with Diverse
Captions for Better Long Video Retrieval [43.58794386905177]
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime.
This neglects the richness and variety of possible valid descriptions of a video.
We propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.
arXiv Detail & Related papers (2023-11-30T18:59:45Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.