Related papers: Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning

Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning

URL: http://arxiv.org/abs/2206.10157v1
Date: Tue, 21 Jun 2022 07:29:37 GMT
Title: Probing Visual-Audio Representation for Video Highlight Detection via Hard-Pairs Guided Contrastive Learning
Authors: Shuaicheng Li, Feng Zhang, Kunlin Yang, Lingbo Liu, Shinan Liu, Jun Hou, Shuai Yi
Abstract summary: Key to effective video representations is cross-modal representation learning and fine-grained feature discrimination. In this paper, we enrich intra-modality and cross-modality relations for representation modeling. We enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning scheme.
Score: 23.472951216815765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video highlight detection is a crucial yet challenging problem that aims to identify the interesting moments in untrimmed videos. The key to this task lies in effective video representations that jointly pursue two goals, \textit{i.e.}, cross-modal representation learning and fine-grained feature discrimination. In this paper, these two challenges are tackled by not only enriching intra-modality and cross-modality relations for representation modeling but also shaping the features in a discriminative manner. Our proposed method mainly leverages the intra-modality encoding and cross-modality co-occurrence encoding for fully representation modeling. Specifically, intra-modality encoding augments the modality-wise features and dampens irrelevant modality via within-modality relation learning in both audio and visual signals. Meanwhile, cross-modality co-occurrence encoding focuses on the co-occurrence inter-modality relations and selectively captures effective information among multi-modality. The multi-modal representation is further enhanced by the global information abstracted from the local context. In addition, we enlarge the discriminative power of feature embedding with a hard-pairs guided contrastive learning (HPCL) scheme. A hard-pairs sampling strategy is further employed to mine the hard samples for improving feature discrimination in HPCL. Extensive experiments conducted on two benchmarks demonstrate the effectiveness and superiority of our proposed methods compared to other state-of-the-art methods.

Related papers

AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Leveraging Weak Cross-Modal Guidance for Coherence Modelling via Iterative Learning [66.28872204574648]
Cross-modal coherence modeling is essential for intelligent systems to help them organize and structure information. Previous work on cross-modal coherence modeling attempted to leverage the order information from another modality to assist the coherence recovering of the target modality. This paper explores a new way to take advantage of cross-modal guidance without gold labels on coherency.
arXiv Detail & Related papers (2024-08-01T06:04:44Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Modality Unifying Network for Visible-Infrared Person Re-Identification [24.186989535051623]
Visible-infrared person re-identification (VI-ReID) is a challenging task due to large cross-modality discrepancies and intra-class variations. Existing methods mainly focus on learning modality-shared representations by embedding different modalities into the same feature space. We propose a novel Modality Unifying Network (MUN) to explore a robust auxiliary modality for VI-ReID.
arXiv Detail & Related papers (2023-09-12T14:22:22Z)
Cross-modal Contrastive Learning for Multimodal Fake News Detection [10.760000041969139]
COOLANT is a cross-modal contrastive learning framework for multimodal fake news detection. A cross-modal fusion module is developed to learn the cross-modality correlations. An attention guidance module is implemented to help effectively and interpretably aggregate the aligned unimodal representations.
arXiv Detail & Related papers (2023-02-25T10:12:34Z)
Multimodal Contrastive Learning via Uni-Modal Coding and Cross-Modal Prediction for Multimodal Sentiment Analysis [19.07020276666615]
We propose a novel framework named MultiModal Contrastive Learning (MMCL) for multimodal representation to capture intra- and inter-modality dynamics simultaneously. We also design two contrastive learning tasks, instance- and sentiment-based contrastive learning, to promote the process of prediction and learn more interactive information related to sentiment.
arXiv Detail & Related papers (2022-10-26T08:24:15Z)
Multi-Modal Mutual Information Maximization: A Novel Approach for Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH) We learn informative representations that can preserve both intra- and inter-modal similarities. The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z)
Robust Latent Representations via Cross-Modal Translation and Alignment [36.67937514793215]
Most multi-modal machine learning methods require that all the modalities used for training are also available for testing. To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only. The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment.
arXiv Detail & Related papers (2020-11-03T11:18:04Z)
Universal Weighting Metric Learning for Cross-Modal Matching [79.32133554506122]
Cross-modal matching has been a highlighted research topic in both vision and language areas. We propose a simple and interpretable universal weighting framework for cross-modal matching.
arXiv Detail & Related papers (2020-10-07T13:16:45Z)
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos. Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem. Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images. We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.