SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval
- URL: http://arxiv.org/abs/2111.05814v1
- Date: Wed, 10 Nov 2021 17:17:09 GMT
- Title: SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval
- Authors: Minyoung Kim
- Abstract summary: We propose a novel loss function that is based on self-labeling of the unknown classes.
We tested our approach on several real-world cross-modal retrieval problems, including text-based video retrieval, sketch-based image retrieval, and image-text retrieval.
- Score: 15.522964295287425
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We tackle the cross-modal retrieval problem, where the training is only
supervised by the relevant multi-modal pairs in the data. The contrastive
learning is the most popular approach for this task. However, its sampling
complexity for learning is quadratic in the number of training data points.
Moreover, it makes potentially wrong assumption that the instances in different
pairs are automatically irrelevant. To address these issues, we propose a novel
loss function that is based on self-labeling of the unknown classes.
Specifically, we aim to predict class labels of the data instances in each
modality, and assign those labels to the corresponding instances in the other
modality (i.e., swapping the pseudo labels). With these swapped labels, we
learn the data embedding for each modality using the supervised cross-entropy
loss, hence leading to linear sampling complexity. We also maintain the queues
for storing the embeddings of the latest batches, for which clustering
assignment and embedding learning are done at the same time in an online
fashion. This removes computational overhead of injecting intermittent epochs
of entire training data sweep for offline clustering. We tested our approach on
several real-world cross-modal retrieval problems, including text-based video
retrieval, sketch-based image retrieval, and image-text retrieval, and for all
these tasks our method achieves significant performance improvement over the
contrastive learning.
Related papers
- Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - MILD: Modeling the Instance Learning Dynamics for Learning with Noisy
Labels [19.650299232829546]
We propose an iterative selection approach based on the Weibull mixture model to identify clean data.
In particular, we measure the difficulty of memorization and memorize for each instance via the transition times between being misclassified and being memorized.
Our strategy outperforms existing noisy-label learning methods.
arXiv Detail & Related papers (2023-06-20T14:26:53Z) - Association Graph Learning for Multi-Task Classification with Category
Shifts [68.58829338426712]
We focus on multi-task classification, where related classification tasks share the same label space and are learned simultaneously.
We learn an association graph to transfer knowledge among tasks for missing classes.
Our method consistently performs better than representative baselines.
arXiv Detail & Related papers (2022-10-10T12:37:41Z) - BatchFormer: Learning to Explore Sample Relationships for Robust
Representation Learning [93.38239238988719]
We propose to enable deep neural networks with the ability to learn the sample relationships from each mini-batch.
BatchFormer is applied into the batch dimension of each mini-batch to implicitly explore sample relationships during training.
We perform extensive experiments on over ten datasets and the proposed method achieves significant improvements on different data scarcity applications.
arXiv Detail & Related papers (2022-03-03T05:31:33Z) - Leveraging Ensembles and Self-Supervised Learning for Fully-Unsupervised
Person Re-Identification and Text Authorship Attribution [77.85461690214551]
Learning from fully-unlabeled data is challenging in Multimedia Forensics problems, such as Person Re-Identification and Text Authorship Attribution.
Recent self-supervised learning methods have shown to be effective when dealing with fully-unlabeled data in cases where the underlying classes have significant semantic differences.
We propose a strategy to tackle Person Re-Identification and Text Authorship Attribution by enabling learning from unlabeled data even when samples from different classes are not prominently diverse.
arXiv Detail & Related papers (2022-02-07T13:08:11Z) - Using Self-Supervised Pretext Tasks for Active Learning [7.214674613451605]
We propose a novel active learning approach that utilizes self-supervised pretext tasks and a unique data sampler to select data that are both difficult and representative.
The pretext task learner is trained on the unlabeled set, and the unlabeled data are sorted and grouped into batches by their pretext task losses.
In each iteration, the main task model is used to sample the most uncertain data in a batch to be annotated.
arXiv Detail & Related papers (2022-01-19T07:58:06Z) - Multi-domain semantic segmentation with overlapping labels [1.4120796122384087]
We propose a principled method for seamless learning on datasets with overlapping classes based on partial labels and probabilistic loss.
Our method achieves competitive within-dataset and cross-dataset generalization, as well as ability to learn visual concepts which are not separately labeled in any of the training datasets.
arXiv Detail & Related papers (2021-08-25T13:25:41Z) - Multimodal Clustering Networks for Self-supervised Learning from
Unlabeled Videos [69.61522804742427]
This paper proposes a self-supervised training framework that learns a common multimodal embedding space.
We extend the concept of instance-level contrastive learning with a multimodal clustering step to capture semantic similarities across modalities.
The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains.
arXiv Detail & Related papers (2021-04-26T15:55:01Z) - Connecting Images through Time and Sources: Introducing Low-data,
Heterogeneous Instance Retrieval [3.6526118822907594]
We show that it is not trivial to pick features responding well to a panel of variations and semantic content.
Introducing a new enhanced version of the Alegoria benchmark, we compare descriptors using the detailed annotations.
arXiv Detail & Related papers (2021-03-19T10:54:51Z) - Improving filling level classification with adversarial training [90.01594595780928]
We investigate the problem of classifying - from a single image - the level of content in a cup or a drinking glass.
We use adversarial training in a generic source dataset and then refine the training with a task-specific dataset.
We show that transfer learning with adversarial training in the source domain consistently improves the classification accuracy on the test set.
arXiv Detail & Related papers (2021-02-08T08:32:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.