(Un)likelihood Training for Interpretable Embedding
- URL: http://arxiv.org/abs/2207.00282v3
- Date: Fri, 10 Nov 2023 10:18:00 GMT
- Title: (Un)likelihood Training for Interpretable Embedding
- Authors: Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan and Zhijian Hou
- Abstract summary: Cross-modal representation learning has become a new normal for bridging the semantic gap between text and visual data.
We propose two novel training objectives, likelihood and unlikelihood functions, to unroll semantics behind embeddings.
With both training objectives, a new encoder-decoder network, which learns interpretable cross-modal representation, is proposed for ad-hoc video search.
- Score: 30.499562324921648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal representation learning has become a new normal for bridging the
semantic gap between text and visual data. Learning modality agnostic
representations in a continuous latent space, however, is often treated as a
black-box data-driven training process. It is well-known that the effectiveness
of representation learning depends heavily on the quality and scale of training
data. For video representation learning, having a complete set of labels that
annotate the full spectrum of video content for training is highly difficult if
not impossible. These issues, black-box training and dataset bias, make
representation learning practically challenging to be deployed for video
understanding due to unexplainable and unpredictable results. In this paper, we
propose two novel training objectives, likelihood and unlikelihood functions,
to unroll semantics behind embeddings while addressing the label sparsity
problem in training. The likelihood training aims to interpret semantics of
embeddings beyond training labels, while the unlikelihood training leverages
prior knowledge for regularization to ensure semantically coherent
interpretation. With both training objectives, a new encoder-decoder network,
which learns interpretable cross-modal representation, is proposed for ad-hoc
video search. Extensive experiments on TRECVid and MSR-VTT datasets show the
proposed network outperforms several state-of-the-art retrieval models with a
statistically significant performance margin.
Related papers
- Exploiting Minority Pseudo-Labels for Semi-Supervised Semantic Segmentation in Autonomous Driving [2.638145329894673]
We propose a professional training module to enhance minority class learning and a general training module to learn more comprehensive semantic information.
In experiments, our framework demonstrates superior performance compared to state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2024-09-19T11:47:25Z) - TVE: Learning Meta-attribution for Transferable Vision Explainer [76.68234965262761]
We introduce a Transferable Vision Explainer (TVE) that can effectively explain various vision models in downstream tasks.
TVE is realized through a pre-training process on large-scale datasets towards learning the meta-attribution.
This meta-attribution leverages the versatility of generic backbone encoders to comprehensively encode the attribution knowledge for the input instance, which enables TVE to seamlessly transfer to explain various downstream tasks.
arXiv Detail & Related papers (2023-12-23T21:49:23Z) - Learning Transferable Pedestrian Representation from Multimodal
Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information.
We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations.
We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Learning Actor-centered Representations for Action Localization in
Streaming Videos using Predictive Learning [18.757368441841123]
Event perception tasks such as recognizing and localizing actions in streaming videos are essential for tackling visual understanding tasks.
We tackle the problem of learning textitactor-centered representations through the notion of continual hierarchical predictive learning.
Inspired by cognitive theories of event perception, we propose a novel, self-supervised framework.
arXiv Detail & Related papers (2021-04-29T06:06:58Z) - Teaching with Commentaries [108.62722733649542]
We propose a flexible teaching framework using commentaries and learned meta-information.
We find that commentaries can improve training speed and/or performance.
commentaries can be reused when training new models to obtain performance benefits.
arXiv Detail & Related papers (2020-11-05T18:52:46Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.