Related papers: Video Captioning: a comparative review of where we are and which could be the route

Video Captioning: a comparative review of where we are and which could be the route

URL: http://arxiv.org/abs/2204.05976v2
Date: Wed, 13 Apr 2022 16:13:43 GMT
Title: Video Captioning: a comparative review of where we are and which could be the route
Authors: Daniela Moctezuma, Tania Ram\'irez-delReal, Guillermo Ruiz, Oth\'on Gonz\'alez-Ch\'avez
Abstract summary: Video captioning is the process of describing the content of a sequence of images capturing its semantic relationships and meanings. This manuscript presents an extensive review of more than 105 papers for the period of 2016 to 2021.
Score: 0.21301560294088315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video captioning is the process of describing the content of a sequence of images capturing its semantic relationships and meanings. Dealing with this task with a single image is arduous, not to mention how difficult it is for a video (or images sequence). The amount and relevance of the applications of video captioning are vast, mainly to deal with a significant amount of video recordings in video surveillance, or assisting people visually impaired, to mention a few. To analyze where the efforts of our community to solve the video captioning task are, as well as what route could be better to follow, this manuscript presents an extensive review of more than 105 papers for the period of 2016 to 2021. As a result, the most-used datasets and metrics are identified. Also, the main approaches used and the best ones. We compute a set of rankings based on several performance metrics to obtain, according to its performance, the best method with the best result on the video captioning task. Finally, some insights are concluded about which could be the next steps or opportunity areas to improve dealing with this complex task.

Related papers

AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [73.62572976072578]
We propose AuroraCap, a video captioner based on a large multimodal model. We implement the token merging strategy, reducing the number of input visual tokens. AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z)
Shot2Story20K: A New Benchmark for Comprehensive Understanding of Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries. Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z)
Video Summarization: Towards Entity-Aware Captions [73.28063602552741]
We propose the task of summarizing news video directly to entity-aware captions. We show that our approach generalizes to existing news image captions dataset.
arXiv Detail & Related papers (2023-12-01T23:56:00Z)
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval [43.58794386905177]
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime. This neglects the richness and variety of possible valid descriptions of a video. We propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.
arXiv Detail & Related papers (2023-11-30T18:59:45Z)
Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset. Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z)
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization. Existing video summarization datasets rely on manual frame-level annotations. We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z)
Bench-Marking And Improving Arabic Automatic Image Captioning Through The Use Of Multi-Task Learning Paradigm [0.0]
This paper explores methods and techniques that could enhance the performance of Arabic image captioning. The use of multi-task learning and pre-trained word embeddings noticeably enhanced the quality of image captioning. However, the presented results shows that Arabic captioning still lags behind when compared to the English language.
arXiv Detail & Related papers (2022-02-11T06:29:25Z)
A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications. Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.