Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal
Retrieval
- URL: http://arxiv.org/abs/2211.03434v1
- Date: Mon, 7 Nov 2022 10:37:14 GMT
- Title: Complete Cross-triplet Loss in Label Space for Audio-visual Cross-modal
Retrieval
- Authors: Donghuo Zeng, Yanan Wang, Jianming Wu, and Kazushi Ikeda
- Abstract summary: Cross-modal data (e.g. audiovisual) have different distributions and representations that cannot be directly compared.
To bridge the gap between audiovisual modalities, we learn a common subspace for them by utilizing the intrinsic correlation in the natural synchronization of audio-visual data with the aid of annotated labels.
We propose a new AV-CMR model to optimize semantic features by directly predicting labels and then measuring the intrinsic correlation between audio-visual data using complete cross-triple loss.
- Score: 7.459223771397159
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The heterogeneity gap problem is the main challenge in cross-modal retrieval.
Because cross-modal data (e.g. audiovisual) have different distributions and
representations that cannot be directly compared. To bridge the gap between
audiovisual modalities, we learn a common subspace for them by utilizing the
intrinsic correlation in the natural synchronization of audio-visual data with
the aid of annotated labels. TNN-CCCA is the best audio-visual cross-modal
retrieval (AV-CMR) model so far, but the model training is sensitive to hard
negative samples when learning common subspace by applying triplet loss to
predict the relative distance between inputs. In this paper, to reduce the
interference of hard negative samples in representation learning, we propose a
new AV-CMR model to optimize semantic features by directly predicting labels
and then measuring the intrinsic correlation between audio-visual data using
complete cross-triple loss. In particular, our model projects audio-visual
features into label space by minimizing the distance between predicted label
features after feature projection and ground label representations. Moreover,
we adopt complete cross-triplet loss to optimize the predicted label features
by leveraging the relationship between all possible similarity and
dissimilarity semantic information across modalities. The extensive
experimental results on two audio-visual double-checked datasets have shown an
improvement of approximately 2.1% in terms of average MAP over the current
state-of-the-art method TNN-CCCA for the AV-CMR task, which indicates the
effectiveness of our proposed model.
Related papers
- Cross-Modal Denoising: A Novel Training Paradigm for Enhancing Speech-Image Retrieval [16.968343177634015]
We introduce an effective framework and a novel learning task named cross-modal denoising (CMD) to enhance cross-modal interaction.
Specifically, CMD is a denoising task designed to reconstruct semantic features from noisy features within one modality by interacting features from another modality.
The experimental results demonstrate that our framework outperforms the state-of-the-art method by 2.0% in mean R@1 on the Flickr8k dataset and by 1.7% in mean R@1 on the SpokenCOCO dataset.
arXiv Detail & Related papers (2024-08-15T02:42:05Z) - Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion [93.32354378820648]
We introduce MVSD, a mutual learning framework based on diffusion models.
MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks.
Our framework can improve the performance of the reverberator and dereverberator.
arXiv Detail & Related papers (2024-07-15T00:47:56Z) - Anchor-aware Deep Metric Learning for Audio-visual Retrieval [11.675472891647255]
Metric learning aims at capturing the underlying data structure and enhancing the performance of tasks like audio-visual cross-modal retrieval (AV-CMR)
Recent works employ sampling methods to select impactful data points from the embedding space during training.
However, the model training fails to fully explore the space due to the scarcity of training data points.
We propose an innovative Anchor-aware Deep Metric Learning (AADML) method to address this challenge.
arXiv Detail & Related papers (2024-04-21T22:44:44Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Cross-Modal Global Interaction and Local Alignment for Audio-Visual
Speech Recognition [21.477900473255264]
We propose a cross-modal global interaction and local alignment (GILA) approach for audio-visual speech recognition (AVSR)
Specifically, we design a global interaction model to capture the A-V complementary relationship on modality level, as well as a local alignment approach to model the A-V temporal consistency on frame level.
Our GILA outperforms the supervised learning state-of-the-art on public benchmarks LRS3 and LRS2.
arXiv Detail & Related papers (2023-05-16T06:41:25Z) - Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues.
We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks.
Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z) - Linking data separation, visual separation, and classifier performance
using pseudo-labeling by contrastive learning [125.99533416395765]
We argue that the performance of the final classifier depends on the data separation present in the latent space and visual separation present in the projection.
We demonstrate our results by the classification of five real-world challenging image datasets of human intestinal parasites with only 1% supervised samples.
arXiv Detail & Related papers (2023-02-06T10:01:38Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Multi-Modal Perception Attention Network with Self-Supervised Learning
for Audio-Visual Speaker Tracking [18.225204270240734]
We propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities.
MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively.
arXiv Detail & Related papers (2021-12-14T14:14:17Z) - Unsupervised Sound Localization via Iterative Contrastive Learning [106.56167882750792]
We propose an iterative contrastive learning framework that requires no data annotations.
We then use the pseudo-labels to learn the correlation between the visual and audio signals sampled from the same video.
Our iterative strategy gradually encourages the localization of the sounding objects and reduces the correlation between the non-sounding regions and the reference audio.
arXiv Detail & Related papers (2021-04-01T07:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.