Two-Stage Triplet Loss Training with Curriculum Augmentation for
Audio-Visual Retrieval
- URL: http://arxiv.org/abs/2310.13451v1
- Date: Fri, 20 Oct 2023 12:35:54 GMT
- Title: Two-Stage Triplet Loss Training with Curriculum Augmentation for
Audio-Visual Retrieval
- Authors: Donghuo Zeng and Kazushi Ikeda
- Abstract summary: Cross- retrieval models learn robust embedding spaces.
We introduce a novel approach rooted in curriculum learning to address this problem.
We propose a two-stage training paradigm that guides the model's learning process from semi-hard to hard triplets.
- Score: 3.164991885881342
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The cross-modal retrieval model leverages the potential of triple loss
optimization to learn robust embedding spaces. However, existing methods often
train these models in a singular pass, overlooking the distinction between
semi-hard and hard triples in the optimization process. The oversight of not
distinguishing between semi-hard and hard triples leads to suboptimal model
performance. In this paper, we introduce a novel approach rooted in curriculum
learning to address this problem. We propose a two-stage training paradigm that
guides the model's learning process from semi-hard to hard triplets. In the
first stage, the model is trained with a set of semi-hard triplets, starting
from a low-loss base. Subsequently, in the second stage, we augment the
embeddings using an interpolation technique. This process identifies potential
hard negatives, alleviating issues arising from high-loss functions due to a
scarcity of hard triples. Our approach then applies hard triplet mining in the
augmented embedding space to further optimize the model. Extensive experimental
results conducted on two audio-visual datasets show a significant improvement
of approximately 9.8% in terms of average Mean Average Precision (MAP) over the
current state-of-the-art method, MSNSCA, for the Audio-Visual Cross-Modal
Retrieval (AV-CMR) task on the AVE dataset, indicating the effectiveness of our
proposed method.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints.
We empirically find that this training paradigm limits the one-step generation performance of consistency models.
We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z) - A Two-Stage Progressive Pre-training using Multi-Modal Contrastive Masked Autoencoders [5.069884983892437]
We propose a new progressive pre-training method for image understanding tasks which leverages RGB-D datasets.
In the first stage, we pre-train the model using contrastive learning to learn cross-modal representations.
In the second stage, we further pre-train the model using masked autoencoding and denoising/noise prediction.
Our approach is scalable, robust and suitable for pre-training RGB-D datasets.
arXiv Detail & Related papers (2024-08-05T05:33:59Z) - Not All Steps are Equal: Efficient Generation with Progressive Diffusion
Models [62.155612146799314]
We propose a novel two-stage training strategy termed Step-Adaptive Training.
In the initial stage, a base denoising model is trained to encompass all timesteps.
We partition the timesteps into distinct groups, fine-tuning the model within each group to achieve specialized denoising capabilities.
arXiv Detail & Related papers (2023-12-20T03:32:58Z) - Training-based Model Refinement and Representation Disagreement for
Semi-Supervised Object Detection [8.096382537967637]
Semi-supervised object detection (SSOD) aims to improve the performance and generalization of existing object detectors.
Recent SSOD methods are still challenged by inadequate model refinement using the classical exponential moving average (EMA) strategy.
This paper proposes a novel training-based model refinement stage and a simple yet effective representation disagreement (RD) strategy.
arXiv Detail & Related papers (2023-07-25T18:26:22Z) - Scaling Multimodal Pre-Training via Cross-Modality Gradient
Harmonization [68.49738668084693]
Self-supervised pre-training recently demonstrates success on large-scale multimodal data.
Cross-modality alignment (CMA) is only a weak and noisy supervision.
CMA might cause conflicts and biases among modalities.
arXiv Detail & Related papers (2022-11-03T18:12:32Z) - MM-Align: Learning Optimal Transport-based Alignment Dynamics for Fast
and Accurate Inference on Missing Modality Sequences [32.42505193560884]
We present a novel approach named MM-Align to address the missing-modality inference problem.
MM-Align learns to capture and imitate the alignment dynamics between modality sequences.
Our method can perform more accurate and faster inference and relieve overfitting under various missing conditions.
arXiv Detail & Related papers (2022-10-23T17:44:56Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Modality-Aware Triplet Hard Mining for Zero-shot Sketch-Based Image
Retrieval [51.42470171051007]
This paper tackles the Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) problem from the viewpoint of cross-modality metric learning.
By combining two fundamental learning approaches in DML, e.g., classification training and pairwise training, we set up a strong baseline for ZS-SBIR.
We show that Modality-Aware Triplet Hard Mining (MATHM) enhances the baseline with three types of pairwise learning.
arXiv Detail & Related papers (2021-12-15T08:36:44Z) - LoOp: Looking for Optimal Hard Negative Embeddings for Deep Metric
Learning [17.571160136568455]
We propose a novel approach that looks for optimal hard negatives (LoOp) in the embedding space.
Unlike mining-based methods, our approach considers the entire space between pairs of embeddings to calculate the optimal hard negatives.
arXiv Detail & Related papers (2021-08-20T19:21:33Z) - A novel three-stage training strategy for long-tailed classification [0.0]
Long-tailed distribution datasets pose great challenges for deep learning based classification models.
We establish a novel three-stages training strategy, which has excellent results for processing SAR image datasets with long-tailed distribution.
arXiv Detail & Related papers (2021-04-20T08:29:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.