HANet: Hierarchical Alignment Networks for Video-Text Retrieval
- URL: http://arxiv.org/abs/2107.12059v1
- Date: Mon, 26 Jul 2021 09:28:50 GMT
- Title: HANet: Hierarchical Alignment Networks for Video-Text Retrieval
- Authors: Peng Wu, Xiangteng He, Mingqian Tang, Yiliang Lv, Jing Liu
- Abstract summary: Video-text retrieval is an important yet challenging task in vision-language understanding.
Most current works simply measure the video-text similarity based on video-level and text-level embeddings.
We propose a Hierarchical Alignment Network (HANet) to align different level representations for video-text matching.
- Score: 15.91922397215452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-text retrieval is an important yet challenging task in vision-language
understanding, which aims to learn a joint embedding space where related video
and text instances are close to each other. Most current works simply measure
the video-text similarity based on video-level and text-level embeddings.
However, the neglect of more fine-grained or local information causes the
problem of insufficient representation. Some works exploit the local details by
disentangling sentences, but overlook the corresponding videos, causing the
asymmetry of video-text representation. To address the above limitations, we
propose a Hierarchical Alignment Network (HANet) to align different level
representations for video-text matching. Specifically, we first decompose video
and text into three semantic levels, namely event (video and text), action
(motion and verb), and entity (appearance and noun). Based on these, we
naturally construct hierarchical representations in the individual-local-global
manner, where the individual level focuses on the alignment between frame and
word, local level focuses on the alignment between video clip and textual
context, and global level focuses on the alignment between the whole video and
text. Different level alignments capture fine-to-coarse correlations between
video and text, as well as take the advantage of the complementary information
among three semantic levels. Besides, our HANet is also richly interpretable by
explicitly learning key semantic concepts. Extensive experiments on two public
datasets, namely MSR-VTT and VATEX, show the proposed HANet outperforms other
state-of-the-art methods, which demonstrates the effectiveness of hierarchical
representation and alignment. Our code is publicly available.
Related papers
- Weakly Supervised Video Representation Learning with Unaligned Text for
Sequential Videos [39.42509966219001]
This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided.
We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video.
Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
arXiv Detail & Related papers (2023-03-22T08:13:25Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Step-Wise Hierarchical Alignment Network for Image-Text Matching [29.07229472373576]
We propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process.
Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially.
arXiv Detail & Related papers (2021-06-11T17:05:56Z) - T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval [59.990432265734384]
Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions.
Most existing methods only consider the global cross-modal similarity and overlook the local details.
In this paper, we design an efficient global-local alignment method.
We achieve consistent improvements on three standard text-video retrieval benchmarks and outperform the state-of-the-art by a clear margin.
arXiv Detail & Related papers (2021-04-20T15:26:24Z) - Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [72.52804406378023]
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web.
To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning model, which decomposes video-text matching into global-to-local levels.
arXiv Detail & Related papers (2020-03-01T03:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.