Video Captioning: a comparative review of where we are and which could
be the route
- URL: http://arxiv.org/abs/2204.05976v2
- Date: Wed, 13 Apr 2022 16:13:43 GMT
- Title: Video Captioning: a comparative review of where we are and which could
be the route
- Authors: Daniela Moctezuma, Tania Ram\'irez-delReal, Guillermo Ruiz, Oth\'on
Gonz\'alez-Ch\'avez
- Abstract summary: Video captioning is the process of describing the content of a sequence of images capturing its semantic relationships and meanings.
This manuscript presents an extensive review of more than 105 papers for the period of 2016 to 2021.
- Score: 0.21301560294088315
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video captioning is the process of describing the content of a sequence of
images capturing its semantic relationships and meanings. Dealing with this
task with a single image is arduous, not to mention how difficult it is for a
video (or images sequence). The amount and relevance of the applications of
video captioning are vast, mainly to deal with a significant amount of video
recordings in video surveillance, or assisting people visually impaired, to
mention a few. To analyze where the efforts of our community to solve the video
captioning task are, as well as what route could be better to follow, this
manuscript presents an extensive review of more than 105 papers for the period
of 2016 to 2021. As a result, the most-used datasets and metrics are
identified. Also, the main approaches used and the best ones. We compute a set
of rankings based on several performance metrics to obtain, according to its
performance, the best method with the best result on the video captioning task.
Finally, some insights are concluded about which could be the next steps or
opportunity areas to improve dealing with this complex task.
Related papers
- AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [89.73538448786405]
We propose AuroraCap, a video captioner based on a large multimodal model.
We implement the token merging strategy, reducing the number of input visual tokens.
AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z) - Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos [58.53311308617818]
We present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and question-answering pairs.
Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos.
The generated imperfect summaries can already achieve competitive performance on existing video understanding tasks.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - Video Summarization: Towards Entity-Aware Captions [73.28063602552741]
We propose the task of summarizing news video directly to entity-aware captions.
We show that our approach generalizes to existing news image captions dataset.
arXiv Detail & Related papers (2023-12-01T23:56:00Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - Bench-Marking And Improving Arabic Automatic Image Captioning Through
The Use Of Multi-Task Learning Paradigm [0.0]
This paper explores methods and techniques that could enhance the performance of Arabic image captioning.
The use of multi-task learning and pre-trained word embeddings noticeably enhanced the quality of image captioning.
However, the presented results shows that Arabic captioning still lags behind when compared to the English language.
arXiv Detail & Related papers (2022-02-11T06:29:25Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.