Related papers: CLOP: Video-and-Language Pre-Training with Knowledge Regularizations

CLOP: Video-and-Language Pre-Training with Knowledge Regularizations

URL: http://arxiv.org/abs/2211.03314v1
Date: Mon, 7 Nov 2022 05:32:12 GMT
Title: CLOP: Video-and-Language Pre-Training with Knowledge Regularizations
Authors: Guohao Li, Hu Yang, Feng He, Zhifan Feng, Yajuan Lyu, Hua Wu, Haifeng Wang
Abstract summary: Video-and-language pre-training has shown promising results for learning generalizable representations. We denote such form of representations as structural knowledge, which express rich semantics of multiple granularities. We propose a Cross-modaL knedgeOwl-enhanced Pre-training (CLOP) method with Knowledge Regularizations.
Score: 43.09248976105326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video-and-language pre-training has shown promising results for learning generalizable representations. Most existing approaches usually model video and text in an implicit manner, without considering explicit structural representations of the multi-modal content. We denote such form of representations as structural knowledge, which express rich semantics of multiple granularities. There are related works that propose object-aware approaches to inject similar knowledge as inputs. However, the existing methods usually fail to effectively utilize such knowledge as regularizations to shape a superior cross-modal representation space. To this end, we propose a Cross-modaL knOwledge-enhanced Pre-training (CLOP) method with Knowledge Regularizations. There are two key designs of ours: 1) a simple yet effective Structural Knowledge Prediction (SKP) task to pull together the latent representations of similar videos; and 2) a novel Knowledge-guided sampling approach for Contrastive Learning (KCL) to push apart cross-modal hard negative samples. We evaluate our method on four text-video retrieval tasks and one multi-choice QA task. The experiments show clear improvements, outperforming prior works by a substantial margin. Besides, we provide ablations and insights of how our methods affect the latent representation space, demonstrating the value of incorporating knowledge regularizations into video-and-language pre-training.

Related papers

Efficient Transfer Learning for Video-language Foundation Models [13.166348605993292]
We propose a simple yet effective Multi-modal Spatio-supervised (MSTA) to improve the alignment between representations in the text and vision branches. We evaluate the effectiveness of our approach across four tasks: zero-shot transfer, few-shot learning, base-to-valiant, and fully-language learning.
arXiv Detail & Related papers (2024-11-18T01:25:58Z)
Conditional Prototype Rectification Prompt Learning [32.533844163120875]
We propose a Prototype Rectification Prompt Learning (CPR) method to correct the bias of base examples and augment limited data in an effective way. CPR achieves state-of-the-art performance on both few-shot classification and base-to-new generalization tasks.
arXiv Detail & Related papers (2024-04-15T15:43:52Z)
Generating Action-conditioned Prompts for Open-vocabulary Video Action Recognition [63.95111791861103]
Existing methods typically adapt pretrained image-text models to the video domain. We argue that augmenting text embeddings with human prior knowledge is pivotal for open-vocabulary video action recognition. Our method not only sets new SOTA performance but also possesses excellent interpretability.
arXiv Detail & Related papers (2023-12-04T02:31:38Z)
DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem. Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
arXiv Detail & Related papers (2023-08-19T15:48:38Z)
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations [54.62547989034184]
We propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space. Experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations.
arXiv Detail & Related papers (2022-11-21T13:12:44Z)
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge [65.40899722211726]
We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations. The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
arXiv Detail & Related papers (2022-09-30T07:39:48Z)
Contrastive Cross-Modal Knowledge Sharing Pre-training for Vision-Language Representation Learning and Retrieval [12.30468719055037]
A Contrastive Cross-Modal Knowledge Sharing Pre-training (COOKIE) is developed to grasp the joint text-image representations. The first module is a weight-sharing transformer that builds on the head of the visual and textual encoders. The other one is three specially designed contrastive learning, aiming to share knowledge between different models.
arXiv Detail & Related papers (2022-07-02T04:08:44Z)
CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge. We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z)
Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
We learn a compositional embedding that closes the cross-modal semantic gap. We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
arXiv Detail & Related papers (2021-04-22T09:31:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.