HunYuan_tvr for Text-Video Retrivial
- URL: http://arxiv.org/abs/2204.03382v1
- Date: Thu, 7 Apr 2022 11:59:36 GMT
- Title: HunYuan_tvr for Text-Video Retrivial
- Authors: Shaobo Min, Weijie Kong, Rong-Cheng Tu, Dihong Gong, Chengfei Cai,
Wenzhe Zhao, Chenyang Liu, Sixiao Zheng, Hongfa Wang, Zhifeng Li, Wei Liu
- Abstract summary: HunYuan_tvr explores hierarchical cross-modal interactions by simultaneously exploring video-sentence, clip-phrase, and frame-word relationships.
HunYuan_tvr obtains new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 57.8%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet respectively.
- Score: 23.650824732136158
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-Video Retrieval plays an important role in multi-modal understanding and
has attracted increasing attention in recent years. Most existing methods focus
on constructing contrastive pairs between whole videos and complete caption
sentences, while ignoring fine-grained cross-modal relationships, e.g., short
clips and phrases or single frame and word. In this paper, we propose a novel
method, named HunYuan\_tvr, to explore hierarchical cross-modal interactions by
simultaneously exploring video-sentence, clip-phrase, and frame-word
relationships. Considering intrinsic semantic relations between frames,
HunYuan\_tvr first performs self-attention to explore frame-wise correlations
and adaptively clusters correlated frames into clip-level representations.
Then, the clip-wise correlation is explored to aggregate clip representations
into a compact one to describe the video globally. In this way, we can
construct hierarchical video representations for frame-clip-video
granularities, and also explore word-wise correlations to form
word-phrase-sentence embeddings for the text modality. Finally, hierarchical
contrastive learning is designed to explore cross-modal
relationships,~\emph{i.e.,} frame-word, clip-phrase, and video-sentence, which
enables HunYuan\_tvr to achieve a comprehensive multi-modal understanding.
Further boosted by adaptive label denosing and marginal sample enhancement,
HunYuan\_tvr obtains new state-of-the-art results on various benchmarks, e.g.,
Rank@1 of 55.0%, 57.8%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC,
DiDemo, and ActivityNet respectively.
Related papers
- SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)
First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.
Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z) - VideoXum: Cross-modal Visual and Textural Summarization of Videos [54.0985975755278]
We propose a new joint video and text summarization task.
The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video.
The generated shortened video clip and text narratives should be semantically well aligned.
arXiv Detail & Related papers (2023-03-21T17:51:23Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Visual Spatio-temporal Relation-enhanced Network for Cross-modal
Text-Video Retrieval [17.443195531553474]
Cross-modal retrieval of texts and videos aims to understand the correspondence between vision and language.
We propose a Visual S-temporal Relation-enhanced semantic network (CNN-SRNet), a cross-temporal retrieval framework.
Experiments are conducted on both MSR-VTT and MSVD datasets.
arXiv Detail & Related papers (2021-10-29T08:23:40Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.