T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
- URL: http://arxiv.org/abs/2104.10054v1
- Date: Tue, 20 Apr 2021 15:26:24 GMT
- Title: T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
- Authors: Xiaohan Wang, Linchao Zhu, Yi Yang
- Abstract summary: Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions.
Most existing methods only consider the global cross-modal similarity and overlook the local details.
In this paper, we design an efficient global-local alignment method.
We achieve consistent improvements on three standard text-video retrieval benchmarks and outperform the state-of-the-art by a clear margin.
- Score: 59.990432265734384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-video retrieval is a challenging task that aims to search relevant video
contents based on natural language descriptions. The key to this problem is to
measure text-video similarities in a joint embedding space. However, most
existing methods only consider the global cross-modal similarity and overlook
the local details. Some works incorporate the local comparisons through
cross-modal local matching and reasoning. These complex operations introduce
tremendous computation. In this paper, we design an efficient global-local
alignment method. The multi-modal video sequences and text features are
adaptively aggregated with a set of shared semantic centers. The local
cross-modal similarities are computed between the video feature and text
feature within the same center. This design enables the meticulous local
comparison and reduces the computational cost of the interaction between each
text-video pair. Moreover, a global alignment method is proposed to provide a
global cross-modal measurement that is complementary to the local perspective.
The global aggregated visual features also provide additional supervision,
which is indispensable to the optimization of the learnable semantic centers.
We achieve consistent improvements on three standard text-video retrieval
benchmarks and outperform the state-of-the-art by a clear margin.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video
Retrieval [66.2075707179047]
We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels.
We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels.
Our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets.
arXiv Detail & Related papers (2022-06-26T11:12:49Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - HANet: Hierarchical Alignment Networks for Video-Text Retrieval [15.91922397215452]
Video-text retrieval is an important yet challenging task in vision-language understanding.
Most current works simply measure the video-text similarity based on video-level and text-level embeddings.
We propose a Hierarchical Alignment Network (HANet) to align different level representations for video-text matching.
arXiv Detail & Related papers (2021-07-26T09:28:50Z) - Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query.
We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query.
The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.