Expectation-Maximization Contrastive Learning for Compact
Video-and-Language Representations
- URL: http://arxiv.org/abs/2211.11427v1
- Date: Mon, 21 Nov 2022 13:12:44 GMT
- Title: Expectation-Maximization Contrastive Learning for Compact
Video-and-Language Representations
- Authors: Peng Jin, Jinfa Huang, Fenglin Liu, Xian Wu, Shen Ge, Guoli Song,
David A. Clifton, Jie Chen
- Abstract summary: We propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations.
Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space.
Experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations.
- Score: 54.62547989034184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most video-and-language representation learning approaches employ contrastive
learning, e.g., CLIP, to project the video and text features into a common
latent space according to the semantic similarities of text-video pairs.
However, such learned shared latent spaces are not often optimal, and the
modality gap between visual and textual representation can not be fully
eliminated. In this paper, we propose Expectation-Maximization Contrastive
Learning (EMCL) to learn compact video-and-language representations.
Specifically, we use the Expectation-Maximization algorithm to find a compact
set of bases for the latent space, where the features could be concisely
represented as the linear combinations of these bases. Such feature
decomposition of video-and-language representations reduces the rank of the
latent space, resulting in increased representing power for the semantics.
Extensive experiments on three benchmark text-video retrieval datasets prove
that our EMCL can learn more discriminative video-and-language representations
than previous methods, and significantly outperform previous state-of-the-art
methods across all metrics. More encouragingly, the proposed method can be
applied to boost the performance of existing approaches either as a jointly
training layer or an out-of-the-box inference module with no extra training,
making it easy to be incorporated into any existing methods.
Related papers
- Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm.
Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Unifying Latent and Lexicon Representations for Effective Video-Text
Retrieval [87.69394953339238]
We propose the UNIFY framework, which learns lexicon representations to capture fine-grained semantics in video-text retrieval.
We show our framework largely outperforms previous video-text retrieval methods, with 4.8% and 8.2% Recall@1 improvement on MSR-VTT and DiDeMo respectively.
arXiv Detail & Related papers (2024-02-26T17:36:50Z) - Towards Efficient and Effective Text-to-Video Retrieval with
Coarse-to-Fine Visual Representation Learning [15.998149438353133]
We propose a two-stage retrieval architecture for text-to-video retrieval.
In training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning.
In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations.
arXiv Detail & Related papers (2024-01-01T08:54:18Z) - UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities.
We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z) - CLOP: Video-and-Language Pre-Training with Knowledge Regularizations [43.09248976105326]
Video-and-language pre-training has shown promising results for learning generalizable representations.
We denote such form of representations as structural knowledge, which express rich semantics of multiple granularities.
We propose a Cross-modaL knedgeOwl-enhanced Pre-training (CLOP) method with Knowledge Regularizations.
arXiv Detail & Related papers (2022-11-07T05:32:12Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.