A Review of Deep Learning for Video Captioning
- URL: http://arxiv.org/abs/2304.11431v1
- Date: Sat, 22 Apr 2023 15:30:54 GMT
- Title: A Review of Deep Learning for Video Captioning
- Authors: Moloud Abdar, Meenakshi Kollati, Swaraja Kuraparthi, Farhad Pourpanah,
Daniel McDuff, Mohammad Ghavamzadeh, Shuicheng Yan, Abduallah Mohamed, Abbas
Khosravi, Erik Cambria, Fatih Porikli
- Abstract summary: Video captioning (VC) is a fast-moving, cross-disciplinary area of research.
This survey covers deep learning-based VC, including but not limited to, attention-based architectures, graph networks, reinforcement learning, adversarial networks, dense video captioning (DVC)
- Score: 111.1557921247882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video captioning (VC) is a fast-moving, cross-disciplinary area of research
that bridges work in the fields of computer vision, natural language processing
(NLP), linguistics, and human-computer interaction. In essence, VC involves
understanding a video and describing it with language. Captioning is used in a
host of applications from creating more accessible interfaces (e.g., low-vision
navigation) to video question answering (V-QA), video retrieval and content
generation. This survey covers deep learning-based VC, including but, not
limited to, attention-based architectures, graph networks, reinforcement
learning, adversarial networks, dense video captioning (DVC), and more. We
discuss the datasets and evaluation metrics used in the field, and limitations,
applications, challenges, and future directions for VC.
Related papers
- ViLCo-Bench: VIdeo Language COntinual learning Benchmark [8.660555226687098]
We present ViLCo-Bench, designed to evaluate continual learning models across a range of video-text tasks.
The dataset comprises ten-minute-long videos and corresponding language queries collected from publicly available datasets.
We introduce a novel memory-efficient framework that incorporates self-supervised learning and mimics long-term and short-term memory effects.
arXiv Detail & Related papers (2024-06-19T00:38:19Z) - VideoDistill: Language-aware Vision Distillation for Video Question Answering [24.675876324457747]
We propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process.
VideoDistill generates answers only from question-related visual embeddings.
We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-01T07:44:24Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation
Protocols [53.706461356853445]
Untrimmed videos have interrelated events, dependencies, context, overlapping events, object-object interactions, domain specificity, and other semantics worth describing.
Video Captioning (DVC) aims at detecting and describing different events in a given video.
arXiv Detail & Related papers (2023-11-05T01:45:31Z) - Video Question Answering Using CLIP-Guided Visual-Text Attention [17.43377106246301]
Cross-modal learning of video and text plays a key role in Video Question Answering (VideoQA)
We propose a visual-text attention mechanism to utilize the Contrastive Language-Image Pre-training (CLIP) trained on lots of general domain language-image pairs.
The proposed method is evaluated on MSVD-QA and MSRVTT-QA datasets, and outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-03-06T13:49:15Z) - Vision-Language Pre-training: Basics, Recent Advances, and Future Trends [158.34830433299268]
Vision-language pre-training methods for multimodal intelligence have been developed in the last few years.
For each category, we present a comprehensive review of state-of-the-art methods, and discuss the progress that has been made and challenges still being faced.
In addition, we discuss advanced topics being actively explored in the research community, such as big foundation models, unified modeling, in-context few-shot learning, knowledge, robustness, and computer vision in the wild, to name a few.
arXiv Detail & Related papers (2022-10-17T17:11:36Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.