NITS-VC System for VATEX Video Captioning Challenge 2020
- URL: http://arxiv.org/abs/2006.04058v2
- Date: Fri, 25 Sep 2020 14:05:13 GMT
- Title: NITS-VC System for VATEX Video Captioning Challenge 2020
- Authors: Alok Singh, Thoudam Doren Singh and Sivaji Bandyopadhyay
- Abstract summary: We employ an encoder-decoder based approach in which the visual features of the video are encoded using 3D convolutional neural network (C3D)
Our model is able to achieve BLEU scores of 0.20 and 0.22 on public and private test data sets respectively.
- Score: 16.628598778804403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video captioning is process of summarising the content, event and action of
the video into a short textual form which can be helpful in many research areas
such as video guided machine translation, video sentiment analysis and
providing aid to needy individual. In this paper, a system description of the
framework used for VATEX-2020 video captioning challenge is presented. We
employ an encoder-decoder based approach in which the visual features of the
video are encoded using 3D convolutional neural network (C3D) and in the
decoding phase two Long Short Term Memory (LSTM) recurrent networks are used in
which visual features and input captions are fused separately and final output
is generated by performing element-wise product between the output of both
LSTMs. Our model is able to achieve BLEU scores of 0.20 and 0.22 on public and
private test data sets respectively.
Related papers
- EVC-MF: End-to-end Video Captioning Network with Multi-scale Features [13.85795110061781]
We propose an end-to-end encoder-decoder-based network (EVC-MF) for video captioning.
It efficiently utilizes multi-scale visual and textual features to generate video descriptions.
The results demonstrate that EVC-MF yields competitive performance compared with the state-of-theart methods.
arXiv Detail & Related papers (2024-10-22T02:16:02Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [112.44822009714461]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.
During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.
Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Text-Conditioned Resampler For Long Form Video Understanding [94.81955667020867]
We present a text-conditioned video resampler (TCR) module that uses a pre-trained visual encoder and large language model (LLM)
TCR can process more than 100 frames at a time with plain attention and without optimised implementations.
arXiv Detail & Related papers (2023-12-19T06:42:47Z) - Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning.
Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Exploiting long-term temporal dynamics for video captioning [40.15826846670479]
We propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences.
Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2022-02-22T11:40:09Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z) - An optimized Capsule-LSTM model for facial expression recognition with
video sequences [0.0]
The model is composed of three networks includingcapsule encoders, capsule decoders and LSTM network.
The experimental results from the MMI dataset show that the Capsule-LSTM model can effectively improve the accuracy of video expression recognition.
arXiv Detail & Related papers (2021-05-27T10:08:05Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.