A Comprehensive Review on Recent Methods and Challenges of Video
Description
- URL: http://arxiv.org/abs/2011.14752v1
- Date: Mon, 30 Nov 2020 13:08:45 GMT
- Title: A Comprehensive Review on Recent Methods and Challenges of Video
Description
- Authors: Alok Singh, Thoudam Doren Singh, Sivaji Bandyopadhyay
- Abstract summary: Video description involves the generation of the natural language description of actions, events, and objects in the video.
There are various applications of video description by filling the gap between languages and vision for visually impaired people.
In the past decade, several works had been done in this field in terms of approaches/methods for video description, evaluation metrics, and datasets.
- Score: 11.69687792533269
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video description involves the generation of the natural language description
of actions, events, and objects in the video. There are various applications of
video description by filling the gap between languages and vision for visually
impaired people, generating automatic title suggestion based on content,
browsing of the video based on the content and video-guided machine translation
[86] etc.In the past decade, several works had been done in this field in terms
of approaches/methods for video description, evaluation metrics,and datasets.
For analyzing the progress in the video description task, a comprehensive
survey is needed that covers all the phases of video description approaches
with a special focus on recent deep learning approaches. In this work, we
report a comprehensive survey on the phases of video description approaches,
the dataset for video description, evaluation metrics, open competitions for
motivating the research on the video description, open challenges in this
field, and future research directions. In this survey, we cover the
state-of-the-art approaches proposed for each and every dataset with their pros
and cons. For the growth of this research domain,the availability of numerous
benchmark dataset is a basic need. Further, we categorize all the dataset into
two classes: open domain dataset and domain-specific dataset. From our survey,
we observe that the work in this field is in fast-paced development since the
task of video description falls in the intersection of computer vision and
natural language processing. But still, the work in the video description is
far from saturation stage due to various challenges like the redundancy due to
similar frames which affect the quality of visual features, the availability of
dataset containing more diverse content and availability of an effective
evaluation metric.
Related papers
- CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - Deep Learning for Video-Text Retrieval: a Review [13.341694455581363]
Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence.
In this survey, we review and summarize over 100 research papers related to VTR.
arXiv Detail & Related papers (2023-02-24T10:14:35Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Bridging Vision and Language from the Video-to-Text Perspective: A
Comprehensive Review [1.0520692160489133]
This review categorizes and describes the state-of-the-art techniques for the video-to-text problem.
It covers the main video-to-text methods and the ways to evaluate their performance.
State-of-the-art techniques are still a long way from achieving human-like performance in generating or retrieving video descriptions.
arXiv Detail & Related papers (2021-03-27T02:12:28Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - A Hierarchical Multi-Modal Encoder for Moment Localization in Video
Corpus [31.387948069111893]
We show how to identify a short segment in a long video that semantically matches a text query.
To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-trimmed frame level.
We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets.
arXiv Detail & Related papers (2020-11-18T02:42:36Z) - Text Synopsis Generation for Egocentric Videos [72.52130695707008]
We propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos.
Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database.
arXiv Detail & Related papers (2020-05-08T00:28:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.