TVPR: Text-to-Video Person Retrieval and a New Benchmark
- URL: http://arxiv.org/abs/2307.07184v2
- Date: Fri, 2 Feb 2024 08:05:10 GMT
- Title: TVPR: Text-to-Video Person Retrieval and a New Benchmark
- Authors: Fan Ni, Xu Zhang, Jianhui Wu, Guan-Nan Dong, Aichun Zhu, Hui Liu, Yue
Zhang
- Abstract summary: We propose a new task called Text-to-Video Person Retrieval(TVPR)
TVPRN acquires video representations by fusing visual and motion representations of person videos.
TVPRN has achieved state-of-the-art performance on TVPReid dataset.
- Score: 19.554989977778312
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing methods for text-based person retrieval focus on text-to-image
person retrieval. Nevertheless, due to the lack of dynamic information provided
by isolated frames, the performance is hampered when the person is obscured in
isolated frames or variable motion details are given in the textual
description. In this paper, we propose a new task called Text-to-Video Person
Retrieval(TVPR) which aims to effectively overcome the limitations of isolated
frames. Since there is no dataset or benchmark that describes person videos
with natural language, we construct a large-scale cross-modal person video
dataset containing detailed natural language annotations, such as person's
appearance, actions and interactions with environment, etc., termed as
Text-to-Video Person Re-identification (TVPReid) dataset, which will be
publicly available. To this end, a Text-to-Video Person Retrieval Network
(TVPRN) is proposed. Specifically, TVPRN acquires video representations by
fusing visual and motion representations of person videos, which can deal with
temporal occlusion and the absence of variable motion details in isolated
frames. Meanwhile, we employ the pre-trained BERT to obtain caption
representations and the relationship between caption and video representations
to reveal the most relevant person videos. To evaluate the effectiveness of the
proposed TVPRN, extensive experiments have been conducted on TVPReid dataset.
To the best of our knowledge, TVPRN is the first successful attempt to use
video for text-based person retrieval task and has achieved state-of-the-art
performance on TVPReid dataset. The TVPReid dataset will be publicly available
to benefit future research.
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - A Large Cross-Modal Video Retrieval Dataset with Reading Comprehension [49.74647080936875]
We introduce a large-scale and cross-modal Video Retrieval dataset with text reading comprehension, TextVR.
The proposed TextVR requires one unified cross-modal model to recognize and comprehend texts, relate them to the visual context, and decide what text semantic information is vital for the video retrieval task.
arXiv Detail & Related papers (2023-05-05T08:00:14Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [13.270902407320005]
We present CLIP2Video network to transfer the image-language training model to video-text retrieval in an end-to-end manner.
We conduct thorough ablation studies, and achieve state-of-the-art performance on text-to-video and video-to-text retrieval benchmarks.
arXiv Detail & Related papers (2021-06-21T13:30:33Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.