Exploiting long-term temporal dynamics for video captioning
- URL: http://arxiv.org/abs/2202.10828v1
- Date: Tue, 22 Feb 2022 11:40:09 GMT
- Title: Exploiting long-term temporal dynamics for video captioning
- Authors: Yuyu Guo, Jingqiu Zhang, Lianli Gao
- Abstract summary: We propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences.
Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.
- Score: 40.15826846670479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically describing videos with natural language is a fundamental
challenge for computer vision and natural language processing. Recently,
progress in this problem has been achieved through two steps: 1) employing 2-D
and/or 3-D Convolutional Neural Networks (CNNs) (e.g. VGG, ResNet or C3D) to
extract spatial and/or temporal features to encode video contents; and 2)
applying Recurrent Neural Networks (RNNs) to generate sentences to describe
events in videos. Temporal attention-based model has gained much progress by
considering the importance of each video frame. However, for a long video,
especially for a video which consists of a set of sub-events, we should
discover and leverage the importance of each sub-shot instead of each frame. In
this paper, we propose a novel approach, namely temporal and spatial LSTM
(TS-LSTM), which systematically exploits spatial and temporal dynamics within
video sequences. In TS-LSTM, a temporal pooling LSTM (TP-LSTM) is designed to
incorporate both spatial and temporal information to extract long-term temporal
dynamics within video sub-shots; and a stacked LSTM is introduced to generate a
list of words to describe the video. Experimental results obtained in two
public video captioning benchmarks indicate that our TS-LSTM outperforms the
state-of-the-art methods.
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V)
We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z) - Streaming Video Model [90.24390609039335]
We propose to unify video understanding tasks into one streaming video architecture, referred to as Streaming Vision Transformer (S-ViT)
S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve frame-based video tasks.
The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition.
arXiv Detail & Related papers (2023-03-30T08:51:49Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Learning to Combine the Modalities of Language and Video for Temporal
Moment Localization [4.203274985072923]
Temporal moment localization aims to retrieve the best video segment matching a moment specified by a query.
We introduce a novel recurrent unit, cross-modal long short-term memory (CM-LSTM), by mimicking the human cognitive process of localizing temporal moments.
We also devise a two-stream attention mechanism for both attended and unattended video features by the input query to prevent necessary visual information from being neglected.
arXiv Detail & Related papers (2021-09-07T08:25:45Z) - BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded
Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos.
Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces.
BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z) - Noisy-LSTM: Improving Temporal Awareness for Video Semantic Segmentation [29.00635219317848]
This paper presents a new model named Noisy-LSTM, which is trainable in an end-to-end manner.
We also present a simple yet effective training strategy, which replaces a frame in video sequence with noises.
arXiv Detail & Related papers (2020-10-19T13:08:15Z) - Comparison of Spatiotemporal Networks for Learning Video Related Tasks [0.0]
Many methods for learning from sequences involve temporally processing 2D CNN features from the individual frames or directly utilizing 3D convolutions within high-performing 2D CNN architectures.
This work constructs an MNIST-based video dataset with parameters controlling relevant facets of common video-related tasks: classification, ordering, and speed estimation.
Models trained on this dataset are shown to differ in key ways depending on the task and their use of 2D convolutions, 3D convolutions, or convolutional LSTMs.
arXiv Detail & Related papers (2020-09-15T19:57:50Z) - Spatio-Temporal Ranked-Attention Networks for Video Captioning [34.05025890230047]
We propose a model that combines spatial and temporal attention to videos in two different orders.
We provide experiments on two benchmark datasets: MSVD and MSR-VTT.
Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.
arXiv Detail & Related papers (2020-01-17T01:00:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.