Bi-Calibration Networks for Weakly-Supervised Video Representation
Learning
- URL: http://arxiv.org/abs/2206.10491v1
- Date: Tue, 21 Jun 2022 16:02:12 GMT
- Title: Bi-Calibration Networks for Weakly-Supervised Video Representation
Learning
- Authors: Fuchen Long and Ting Yao and Zhaofan Qiu and Xinmei Tian and Jiebo Luo
and Tao Mei
- Abstract summary: We introduce a new design of mutual calibration between query and text to boost weakly-supervised video representation learning.
We present Bi-Calibration Networks (BCN) that novelly couples two calibrations to learn the amendment from text to query and vice versa.
BCN learnt on 3M web videos obtain superior results under linear model protocol on downstream tasks.
- Score: 153.54638582696128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The leverage of large volumes of web videos paired with the searched queries
or surrounding texts (e.g., title) offers an economic and extensible
alternative to supervised video representation learning. Nevertheless, modeling
such weakly visual-textual connection is not trivial due to query polysemy
(i.e., many possible meanings for a query) and text isomorphism (i.e., same
syntactic structure of different text). In this paper, we introduce a new
design of mutual calibration between query and text to boost weakly-supervised
video representation learning. Specifically, we present Bi-Calibration Networks
(BCN) that novelly couples two calibrations to learn the amendment from text to
query and vice versa. Technically, BCN executes clustering on all the titles of
the videos searched by an identical query and takes the centroid of each
cluster as a text prototype. The query vocabulary is built directly on query
words. The video-to-text/video-to-query projections over text prototypes/query
vocabulary then start the text-to-query or query-to-text calibration to
estimate the amendment to query or text. We also devise a selection scheme to
balance the two corrections. Two large-scale web video datasets paired with
query and title for each video are newly collected for weakly-supervised video
representation learning, which are named as YOVO-3M and YOVO-10M, respectively.
The video features of BCN learnt on 3M web videos obtain superior results under
linear model protocol on downstream tasks. More remarkably, BCN trained on the
larger set of 10M web videos with further fine-tuning leads to 1.6%, and 1.8%
gains in top-1 accuracy on Kinetics-400, and Something-Something V2 datasets
over the state-of-the-art TDN, and ACTION-Net methods with ImageNet
pre-training. Source code and datasets are available at
\url{https://github.com/FuchenUSTC/BCN}.
Related papers
- Videoprompter: an ensemble of foundational models for zero-shot video
understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations.
We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - Are All Combinations Equal? Combining Textual and Visual Features with
Multiple Space Learning for Text-Based Video Retrieval [9.537322316673617]
We investigate how to optimally combine multiple diverse textual and visual features into feature pairs.
To learn these representations our proposed network architecture is trained by following a multiple space learning procedure.
arXiv Detail & Related papers (2022-11-21T11:08:13Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Text-Adaptive Multiple Visual Prototype Matching for Video-Text
Retrieval [125.55386778388818]
Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web.
We propose a Text-Adaptive Multiple Visual Prototype Matching model, which automatically captures multiple prototypes to describe a video.
Our method outperforms state-of-the-art methods on four public video retrieval datasets.
arXiv Detail & Related papers (2022-09-27T11:13:48Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ)
A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features.
In the form of questions and answers, the semantic associations between local video-text features can be properly established.
Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.