CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
- URL: http://arxiv.org/abs/2205.00823v1
- Date: Mon, 2 May 2022 12:02:09 GMT
- Title: CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
- Authors: Shuai Zhao and Linchao Zhu and Xiaohan Wang and Yi Yang
- Abstract summary: In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
- Score: 67.21528544724546
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently, large-scale pre-training methods like CLIP have made great progress
in multi-modal research such as text-video retrieval. In CLIP, transformers are
vital for modeling complex multi-modal relations. However, in the vision
transformer of CLIP, the essential visual tokenization process, which produces
discrete visual token sequences, generates many homogeneous tokens due to the
redundancy nature of consecutive and similar frames in videos. This
significantly increases computation costs and hinders the deployment of video
retrieval models in web applications. In this paper, to reduce the number of
redundant video tokens, we design a multi-segment token clustering algorithm to
find the most representative tokens and drop the non-essential ones. As the
frame redundancy occurs mostly in consecutive frames, we divide videos into
multiple segments and conduct segment-level clustering. Center tokens from each
segment are later concatenated into a new sequence, while their original
spatial-temporal relations are well maintained. We instantiate two clustering
algorithms to efficiently find deterministic medoids and iteratively partition
groups in high dimensional space. Through this token clustering and center
selection procedure, we successfully reduce computation costs by removing
redundant visual tokens. This method further enhances segment-level semantic
alignment between video and text representations, enforcing the spatio-temporal
interactions of tokens from within-segment frames. Our method, coined as
CenterCLIP, surpasses existing state-of-the-art by a large margin on typical
text-video benchmarks, while reducing the training memory cost by 35\% and
accelerating the inference speed by 14\% at the best case. The code is
available at
\href{{https://github.com/mzhaoshuai/CenterCLIP}}{{https://github.com/mzhaoshuai/CenterCLIP}}.
Related papers
- Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - Self-supervised Object-Centric Learning for Videos [39.02148880719576]
We propose the first fully unsupervised method for segmenting multiple objects in real-world sequences.
Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames.
Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.
arXiv Detail & Related papers (2023-10-10T18:03:41Z) - Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation [76.40565872257709]
We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning.
It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos.
Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
arXiv Detail & Related papers (2023-03-17T16:23:36Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Dense Video Captioning Using Unsupervised Semantic Information [2.022555840231001]
We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events.
We split a long video into short frame sequences to extract their latent representation with three-dimensional convolutional neural networks.
We demonstrate how this representation can leverage the performance of the dense video captioning task in a scenario with only visual features.
arXiv Detail & Related papers (2021-12-15T20:03:42Z) - Rethinking Space-Time Networks with Improved Memory Coverage for
Efficient Video Object Segmentation [68.45737688496654]
We establish correspondences directly between frames without re-encoding the mask features for every object.
With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion.
We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy.
arXiv Detail & Related papers (2021-06-09T16:50:57Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.