Weakly Supervised Video Representation Learning with Unaligned Text for
Sequential Videos
- URL: http://arxiv.org/abs/2303.12370v2
- Date: Tue, 28 Mar 2023 04:43:12 GMT
- Title: Weakly Supervised Video Representation Learning with Unaligned Text for
Sequential Videos
- Authors: Sixun Dong, Huazhang Hu, Dongze Lian, Weixin Luo, Yicheng Qian,
Shenghua Gao
- Abstract summary: This paper studies weakly supervised sequential video understanding where the accurate time-level text-video alignment is not provided.
We use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video.
Experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin.
- Score: 39.42509966219001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequential video understanding, as an emerging video understanding task, has
driven lots of researchers' attention because of its goal-oriented nature. This
paper studies weakly supervised sequential video understanding where the
accurate time-stamp level text-video alignment is not provided. We solve this
task by borrowing ideas from CLIP. Specifically, we use a transformer to
aggregate frame-level features for video representation and use a pre-trained
text encoder to encode the texts corresponding to each action and the whole
video, respectively. To model the correspondence between text and video, we
propose a multiple granularity loss, where the video-paragraph contrastive loss
enforces matching between the whole video and the complete script, and a
fine-grained frame-sentence contrastive loss enforces the matching between each
action and its description. As the frame-sentence correspondence is not
available, we propose to use the fact that video actions happen sequentially in
the temporal domain to generate pseudo frame-sentence correspondence and
supervise the network training with the pseudo labels. Extensive experiments on
video sequence verification and text-to-video matching show that our method
outperforms baselines by a large margin, which validates the effectiveness of
our proposed approach. Code is available at https://github.com/svip-lab/WeakSVR
Related papers
- OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Text-Video Retrieval via Variational Multi-Modal Hypergraph Networks [25.96897989272303]
Main obstacle for text-video retrieval is the semantic gap between the textual nature of queries and the visual richness of video content.
We propose chunk-level text-video matching, where the query chunks are extracted to describe a specific retrieval unit.
We formulate the chunk-level matching as n-ary correlations modeling between words of the query and frames of the video.
arXiv Detail & Related papers (2024-01-06T09:38:55Z) - Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.
We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - TempCLR: Temporal Alignment Representation with Contrastive Learning [35.12182087403215]
We propose a contrastive learning framework TempCLR to compare the full video and the paragraph explicitly.
In addition to pre-training on the video and paragraph, our approach can also generalize on the matching between video instances.
arXiv Detail & Related papers (2022-12-28T08:10:31Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Video and Text Matching with Conditioned Embeddings [81.81028089100727]
We present a method for matching a text sentence from a given corpus to a given video clip and vice versa.
In this work, we encode the dataset data in a way that takes into account the query's relevant information.
We show that our conditioned representation can be transferred to video-guided machine translation, where we improved the current results on VATEX.
arXiv Detail & Related papers (2021-10-21T17:31:50Z) - VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text
Understanding [13.640902299569008]
We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding.
VideoCLIP trains a transformer for video and text by contrasting temporally overlapping positive video-text pairs with hard negatives from nearest neighbor retrieval.
arXiv Detail & Related papers (2021-09-28T23:01:51Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.