Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning
- URL: http://arxiv.org/abs/2206.10157v1
- Date: Tue, 21 Jun 2022 07:29:37 GMT
- Title: Probing Visual-Audio Representation for Video Highlight Detection via
Hard-Pairs Guided Contrastive Learning
- Authors: Shuaicheng Li, Feng Zhang, Kunlin Yang, Lingbo Liu, Shinan Liu, Jun
Hou, Shuai Yi
- Abstract summary: Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination.
In this paper, we enrich intra-modality and cross-modality relations for representation modeling.
We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
- Score: 23.472951216815765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video highlight detection is a crucial yet challenging problem that aims to
identify the interesting moments in untrimmed videos. The key to this task lies
in effective video representations that jointly pursue two goals,
\textit{i.e.}, cross-modal representation learning and fine-grained feature
discrimination. In this paper, these two challenges are tackled by not only
enriching intra-modality and cross-modality relations for representation
modeling but also shaping the features in a discriminative manner. Our proposed
method mainly leverages the intra-modality encoding and cross-modality
co-occurrence encoding for fully representation modeling. Specifically,
intra-modality encoding augments the modality-wise features and dampens
irrelevant modality via within-modality relation learning in both audio and
visual signals. Meanwhile, cross-modality co-occurrence encoding focuses on the
co-occurrence inter-modality relations and selectively captures effective
information among multi-modality. The multi-modal representation is further
enhanced by the global information abstracted from the local context. In
addition, we enlarge the discriminative power of feature embedding with a
hard-pairs guided contrastive learning (HPCL) scheme. A hard-pairs sampling
strategy is further employed to mine the hard samples for improving feature
discrimination in HPCL. Extensive experiments conducted on two benchmarks
demonstrate the effectiveness and superiority of our proposed methods compared
to other state-of-the-art methods.
Related papers
- Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Modality Unifying Network for Visible-Infrared Person Re-Identification [24.186989535051623]
Visible-infrared person re-identification (VI-ReID) is a challenging task due to large cross-modality discrepancies and intra-class variations.
Existing methods mainly focus on learning modality-shared representations by embedding different modalities into the same feature space.
We propose a novel Modality Unifying Network (MUN) to explore a robust auxiliary modality for VI-ReID.
arXiv Detail & Related papers (2023-09-12T14:22:22Z) - Cross-modal Contrastive Learning for Multimodal Fake News Detection [10.760000041969139]
COOLANT is a cross-modal contrastive learning framework for multimodal fake news detection.
A cross-modal fusion module is developed to learn the cross-modality correlations.
An attention guidance module is implemented to help effectively and interpretably aggregate the aligned unimodal representations.
arXiv Detail & Related papers (2023-02-25T10:12:34Z) - Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal
Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously.
We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z) - Robust Latent Representations via Cross-Modal Translation and Alignment [36.67937514793215]
Most multi-modal machine learning methods require that all the modalities used for training are also available for testing.
To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only.
The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment.
arXiv Detail & Related papers (2020-11-03T11:18:04Z) - Universal Weighting Metric Learning for Cross-Modal Matching [79.32133554506122]
Cross-modal matching has been a highlighted research topic in both vision and language areas.
We propose a simple and interpretable universal weighting framework for cross-modal matching.
arXiv Detail & Related papers (2020-10-07T13:16:45Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.