CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
- URL: http://arxiv.org/abs/2205.00823v1
- Date: Mon, 2 May 2022 12:02:09 GMT
- Title: CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
- Authors: Shuai Zhao and Linchao Zhu and Xiaohan Wang and Yi Yang
- Abstract summary: In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
- Score: 67.21528544724546
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently, large-scale pre-training methods like CLIP have made great progress
in multi-modal research such as text-video retrieval. In CLIP, transformers are
vital for modeling complex multi-modal relations. However, in the vision
transformer of CLIP, the essential visual tokenization process, which produces
discrete visual token sequences, generates many homogeneous tokens due to the
redundancy nature of consecutive and similar frames in videos. This
significantly increases computation costs and hinders the deployment of video
retrieval models in web applications. In this paper, to reduce the number of
redundant video tokens, we design a multi-segment token clustering algorithm to
find the most representative tokens and drop the non-essential ones. As the
frame redundancy occurs mostly in consecutive frames, we divide videos into
multiple segments and conduct segment-level clustering. Center tokens from each
segment are later concatenated into a new sequence, while their original
spatial-temporal relations are well maintained. We instantiate two clustering
algorithms to efficiently find deterministic medoids and iteratively partition
groups in high dimensional space. Through this token clustering and center
selection procedure, we successfully reduce computation costs by removing
redundant visual tokens. This method further enhances segment-level semantic
alignment between video and text representations, enforcing the spatio-temporal
interactions of tokens from within-segment frames. Our method, coined as
CenterCLIP, surpasses existing state-of-the-art by a large margin on typical
text-video benchmarks, while reducing the training memory cost by 35\% and
accelerating the inference speed by 14\% at the best case. The code is
available at
\href{{https://github.com/mzhaoshuai/CenterCLIP}}{{https://github.com/mzhaoshuai/CenterCLIP}}.
Related papers
- The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.
We propose VRS-HQ, an end-to-end video reasoning segmentation approach.
Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z) - Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - Self-supervised Object-Centric Learning for Videos [39.02148880719576]
We propose the first fully unsupervised method for segmenting multiple objects in real-world sequences.
Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames.
Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.
arXiv Detail & Related papers (2023-10-10T18:03:41Z) - SOC: Semantic-Assisted Object Cluster for Referring Video Object
Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment.
We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment.
We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z) - Unified Mask Embedding and Correspondence Learning for Self-Supervised
Video Segmentation [76.40565872257709]
We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning.
It is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos.
Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS)
arXiv Detail & Related papers (2023-03-17T16:23:36Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Rethinking Space-Time Networks with Improved Memory Coverage for
Efficient Video Object Segmentation [68.45737688496654]
We establish correspondences directly between frames without re-encoding the mask features for every object.
With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion.
We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy.
arXiv Detail & Related papers (2021-06-09T16:50:57Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.