Spatio-Temporal Ranked-Attention Networks for Video Captioning
- URL: http://arxiv.org/abs/2001.06127v1
- Date: Fri, 17 Jan 2020 01:00:45 GMT
- Title: Spatio-Temporal Ranked-Attention Networks for Video Captioning
- Authors: Anoop Cherian, Jue Wang, Chiori Hori, Tim K. Marks
- Abstract summary: We propose a model that combines spatial and temporal attention to videos in two different orders.
We provide experiments on two benchmark datasets: MSVD and MSR-VTT.
Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.
- Score: 34.05025890230047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating video descriptions automatically is a challenging task that
involves a complex interplay between spatio-temporal visual features and
language models. Given that videos consist of spatial (frame-level) features
and their temporal evolutions, an effective captioning model should be able to
attend to these different cues selectively. To this end, we propose a
Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned
on the language state, hierarchically combines spatial and temporal attention
to videos in two different orders: (i) a spatio-temporal (ST) sub-model, which
first attends to regions that have temporal evolution, then temporally pools
the features from these regions; and (ii) a temporo-spatial (TS) sub-model,
which first decides a single frame to attend to, then applies spatial attention
within that frame. We propose a novel LSTM-based temporal ranking function,
which we call ranked attention, for the ST model to capture action dynamics.
Our entire framework is trained end-to-end. We provide experiments on two
benchmark datasets: MSVD and MSR-VTT. Our results demonstrate the synergy
between the ST and TS modules, outperforming recent state-of-the-art methods.
Related papers
- Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method.
It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Revisiting the Spatial and Temporal Modeling for Few-shot Action
Recognition [16.287968292213563]
We propose SloshNet, a new framework that revisits the spatial and temporal modeling for few-shot action recognition in a finer manner.
We extensively validate the proposed SloshNet on four few-shot action recognition datasets, including Something-Something V2, Kinetics, UCF101, and HMDB51.
arXiv Detail & Related papers (2023-01-19T08:34:04Z) - TubeDETR: Spatio-Temporal Video Grounding with Transformers [89.71617065426146]
We consider the problem of encoder localizing a-temporal tube in a video corresponding to a given text query.
To address this task, we propose TubeDETR, a transformer- architecture inspired by the recent success of such models for text-conditioned object detection.
arXiv Detail & Related papers (2022-03-30T16:31:49Z) - Exploiting long-term temporal dynamics for video captioning [40.15826846670479]
We propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences.
Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2022-02-22T11:40:09Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form
Sentences [107.0776836117313]
Given an un-trimmed video and a declarative/interrogative sentence, STVG aims to localize the-temporal tube of the object queried.
Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of novel object relationship modeling.
We present a declarative-Temporal Graph Reasoning Network (STGRN) for this task.
arXiv Detail & Related papers (2020-01-19T19:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.