Bridging Vision and Language from the Video-to-Text Perspective: A
Comprehensive Review
- URL: http://arxiv.org/abs/2103.14785v1
- Date: Sat, 27 Mar 2021 02:12:28 GMT
- Title: Bridging Vision and Language from the Video-to-Text Perspective: A
Comprehensive Review
- Authors: Jesus Perez-Martin and Benjamin Bustos and Silvio Jamil F. Guimar\~aes
and Ivan Sipiran and Jorge P\'erez and Grethel Coello Said
- Abstract summary: This review categorizes and describes the state-of-the-art techniques for the video-to-text problem.
It covers the main video-to-text methods and the ways to evaluate their performance.
State-of-the-art techniques are still a long way from achieving human-like performance in generating or retrieving video descriptions.
- Score: 1.0520692160489133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Research in the area of Vision and Language encompasses challenging topics
that seek to connect visual and textual information. The video-to-text problem
is one of these topics, in which the goal is to connect an input video with its
textual description. This connection can be mainly made by retrieving the most
significant descriptions from a corpus or generating a new one given a context
video. These two ways represent essential tasks for Computer Vision and Natural
Language Processing communities, called text retrieval from video task and
video captioning/description task. These two tasks are substantially more
complex than predicting or retrieving a single sentence from an image. The
spatiotemporal information present in videos introduces diversity and
complexity regarding the visual content and the structure of associated
language descriptions. This review categorizes and describes the
state-of-the-art techniques for the video-to-text problem. It covers the main
video-to-text methods and the ways to evaluate their performance. We analyze
how the most reported benchmark datasets have been created, showing their
drawbacks and strengths for the problem requirements. We also show the
impressive progress that researchers have made on each dataset, and we analyze
why, despite this progress, the video-to-text conversion is still unsolved.
State-of-the-art techniques are still a long way from achieving human-like
performance in generating or retrieving video descriptions. We cover several
significant challenges in the field and discuss future research directions.
Related papers
- In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - Understanding Video Scenes through Text: Insights from Text-based Video
Question Answering [40.01623654896573]
This paper explores two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content.
We provide an analysis of the formulation of these datasets on various levels, exploring the degree of visual understanding and multi-frame comprehension required for answering the questions.
arXiv Detail & Related papers (2023-09-04T06:11:00Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Deep Learning for Video-Text Retrieval: a Review [13.341694455581363]
Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence.
In this survey, we review and summarize over 100 research papers related to VTR.
arXiv Detail & Related papers (2023-02-24T10:14:35Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - A Comprehensive Review on Recent Methods and Challenges of Video
Description [11.69687792533269]
Video description involves the generation of the natural language description of actions, events, and objects in the video.
There are various applications of video description by filling the gap between languages and vision for visually impaired people.
In the past decade, several works had been done in this field in terms of approaches/methods for video description, evaluation metrics, and datasets.
arXiv Detail & Related papers (2020-11-30T13:08:45Z) - Text Synopsis Generation for Egocentric Videos [72.52130695707008]
We propose to generate a textual synopsis, consisting of a few sentences describing the most important events in a long egocentric videos.
Users can read the short text to gain insight about the video, and more importantly, efficiently search through the content of a large video database.
arXiv Detail & Related papers (2020-05-08T00:28:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.