HiT: Hierarchical Transformer with Momentum Contrast for Video-Text
Retrieval
- URL: http://arxiv.org/abs/2103.15049v1
- Date: Sun, 28 Mar 2021 04:52:25 GMT
- Title: HiT: Hierarchical Transformer with Momentum Contrast for Video-Text
Retrieval
- Authors: Song Liu and Haoqi Fan and Shengsheng Qian and Yiru Chen and Wenkui
Ding and Zhongyuan Wang
- Abstract summary: We propose a novel approach named Hierarchical Transformer (HiT) for video-text retrieval.
HiT performs hierarchical cross-modal contrastive matching in feature-level and semantic-level to achieve multi-view and comprehensive retrieval results.
Inspired by MoCo, we propose Momentum Cross-modal Contrast for cross-modal learning to enable large-scale negative interactions on-the-fly.
- Score: 40.646628490887075
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video-Text Retrieval has been a hot research topic with the explosion of
multimedia data on the Internet. Transformer for video-text learning has
attracted increasing attention due to the promising performance.However,
existing cross-modal transformer approaches typically suffer from two major
limitations: 1) Limited exploitation of the transformer architecture where
different layers have different feature characteristics. 2) End-to-end training
mechanism limits negative interactions among samples in a mini-batch. In this
paper, we propose a novel approach named Hierarchical Transformer (HiT) for
video-text retrieval. HiT performs hierarchical cross-modal contrastive
matching in feature-level and semantic-level to achieve multi-view and
comprehensive retrieval results. Moreover, inspired by MoCo, we propose
Momentum Cross-modal Contrast for cross-modal learning to enable large-scale
negative interactions on-the-fly, which contributes to the generation of more
precise and discriminative representations. Experimental results on three major
Video-Text Retrieval benchmark datasets demonstrate the advantages of our
methods.
Related papers
- InTraGen: Trajectory-controlled Video Generation for Object Interactions [100.79494904451246]
InTraGen is a pipeline for improved trajectory-based generation of object interaction scenarios.
Our results demonstrate improvements in both visual fidelity and quantitative performance.
arXiv Detail & Related papers (2024-11-25T14:27:50Z) - Collaborative Three-Stream Transformers for Video Captioning [23.889653636822207]
We design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation.
COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial-temporal domain between videos and text.
We propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions.
arXiv Detail & Related papers (2023-09-18T09:33:25Z) - Multilevel Transformer For Multimodal Emotion Recognition [6.0149102420697025]
We introduce a novel multi-granularity framework, which combines fine-grained representation with pre-trained utterance-level representation.
Inspired by Transformer TTS, we propose a multilevel transformer model to perform fine-grained multimodal emotion recognition.
arXiv Detail & Related papers (2022-10-26T10:31:24Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization.
To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer.
Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z) - TransVG: End-to-End Visual Grounding with Transformers [102.11922622103613]
We present a transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to an image.
We show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
arXiv Detail & Related papers (2021-04-17T13:35:24Z) - Hierarchical Transformer Network for Utterance-level Emotion Recognition [0.0]
We address some challenges in utter-ance-level emotion recognition (ULER)
Unlike the traditional text classification problem, this task is supported by a limited number of datasets.
We use a pretrained language model bidirectional encoder representa-tions from transformers (BERT) as the lower-level transformer.
In addition, we add speaker embeddings to the model for the first time, which enables our model to capture the in-teraction between speakers.
arXiv Detail & Related papers (2020-02-18T13:44:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.