TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
- URL: http://arxiv.org/abs/2207.07852v1
- Date: Sat, 16 Jul 2022 06:50:27 GMT
- Title: TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
- Authors: Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao and Qin Jin
- Abstract summary: We propose Token Shift and Selection Network (TS2-Net), a novel token shift and selection transformer architecture.
Based on thorough experiments, the proposed TS2-Net achieves state-of-the-art performance on major text-video retrieval benchmarks.
- Score: 42.0544426476143
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-Video retrieval is a task of great practical value and has received
increasing attention, among which learning spatial-temporal video
representation is one of the research hotspots. The video encoders in the
state-of-the-art video retrieval models usually directly adopt the pre-trained
vision backbones with the network structure fixed, they therefore can not be
further improved to produce the fine-grained spatial-temporal video
representation. In this paper, we propose Token Shift and Selection Network
(TS2-Net), a novel token shift and selection transformer architecture, which
dynamically adjusts the token sequence and selects informative tokens in both
temporal and spatial dimensions from input video samples. The token shift
module temporally shifts the whole token features back-and-forth across
adjacent frames, to preserve the complete token representation and capture
subtle movements. Then the token selection module selects tokens that
contribute most to local spatial semantics. Based on thorough experiments, the
proposed TS2-Net achieves state-of-the-art performance on major text-video
retrieval benchmarks, including new records on MSRVTT, VATEX, LSMDC,
ActivityNet, and DiDeMo.
Related papers
- Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
Text-to-Video Generation [115.09597127418452]
Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model.
We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
arXiv Detail & Related papers (2023-04-17T17:57:06Z) - MILES: Visual BERT Pre-training with Injected Language Semantics for
Video-text Retrieval [43.2299969152561]
Methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
arXiv Detail & Related papers (2022-04-26T16:06:31Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Efficient Video Transformers with Spatial-Temporal Token Selection [68.27784654734396]
We present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples.
Our framework achieves similar results while requiring 20% less computation.
arXiv Detail & Related papers (2021-11-23T00:35:58Z) - Spatial-Temporal Transformer for Dynamic Scene Graph Generation [34.190733855032065]
We propose a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input.
Our method is validated on the benchmark dataset Action Genome (AG)
arXiv Detail & Related papers (2021-07-26T16:30:30Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.