MXM-CLR: A Unified Framework for Contrastive Learning of Multifold
Cross-Modal Representations
- URL: http://arxiv.org/abs/2303.10839v2
- Date: Tue, 21 Mar 2023 02:37:37 GMT
- Title: MXM-CLR: A Unified Framework for Contrastive Learning of Multifold
Cross-Modal Representations
- Authors: Ye Wang, Bowei Jiang, Changqing Zou, Rui Ma
- Abstract summary: We propose MXM-CLR, a unified framework for contrastive learning of multifold cross-modal representations.
XM-CLR explicitly models and learns the relationships between multifold observations of instances from different modalities.
Results show the superiority of MXM-CLR in learning better representations for the multifold data.
- Score: 14.355743915598554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multifold observations are common for different data modalities, e.g., a 3D
shape can be represented by multi-view images and an image can be described
with different captions. Existing cross-modal contrastive representation
learning (XM-CLR) methods such as CLIP are not fully suitable for multifold
data as they only consider one positive pair and treat other pairs as negative
when computing the contrastive loss. In this paper, we propose MXM-CLR, a
unified framework for contrastive learning of multifold cross-modal
representations. MXM-CLR explicitly models and learns the relationships between
multifold observations of instances from different modalities for more
comprehensive representation learning. The key of MXM-CLR is a novel
multifold-aware hybrid loss which considers multiple positive observations when
computing the hard and soft relationships for the cross-modal data pairs. We
conduct quantitative and qualitative comparisons with SOTA baselines for
cross-modal retrieval tasks on the Text2Shape and Flickr30K datasets. We also
perform extensive evaluations on the adaptability and generalizability of
MXM-CLR, as well as ablation studies on the loss design and effects of batch
sizes. The results show the superiority of MXM-CLR in learning better
representations for the multifold data. The code is available at
https://github.com/JLU-ICL/MXM-CLR.
Related papers
- Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval [73.77101139365912]
We propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling.
Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map.
We employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations.
arXiv Detail & Related papers (2024-08-20T06:30:37Z) - MIBench: Evaluating Multimodal Large Language Models over Multiple Images [70.44423964171088]
We propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios.
Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC)
The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs.
arXiv Detail & Related papers (2024-07-21T21:22:58Z) - MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era [72.95901753186227]
Multi-Modal Relation Understanding (MMRel) is a comprehensive dataset for studying inter-object relations with Multi-modal Large Language Models (MLLMs)
MMRel features three distinctive attributes: (i) It includes over 15K question-answer pairs, which are sourced from three distinct domains, ensuring large scale and high diversity; (ii) It contains a subset featuring highly unusual relations, on which MLLMs often fail due to hallucinations, thus are very challenging; (iii) It provides manually verified high-quality labels for inter-object relations.
arXiv Detail & Related papers (2024-06-13T13:51:59Z) - MMCL: Boosting Deformable DETR-Based Detectors with Multi-Class Min-Margin Contrastive Learning for Superior Prohibited Item Detection [8.23801404004195]
Prohibited Item detection in X-ray images is one of the most effective security inspection methods.
overlapping unique phenomena in X-ray images lead to the coupling of foreground and background features.
We propose a Multi-Class Min-Margin Contrastive Learning (MMCL) method to clarify the category semantic information of content queries.
arXiv Detail & Related papers (2024-06-05T12:07:58Z) - SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models [97.40590590880144]
We develop an extensive Multimodality Large Language Model (MLLM) series.
We assemble a comprehensive dataset covering publicly available resources in language, vision, and vision-language tasks.
We obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities.
arXiv Detail & Related papers (2024-02-08T18:59:48Z) - CPCL: Cross-Modal Prototypical Contrastive Learning for Weakly
Supervised Text-based Person Re-Identification [10.64115914599574]
Weakly supervised text-based person re-identification (TPRe-ID) seeks to retrieve images of a target person using textual descriptions.
The primary challenge is the intra-class differences, encompassing intra-modal feature variations and cross-modal semantic gaps.
In practice, the CPCL introduces the CLIP model to weakly supervised TPRe-ID for the first time, mapping visual and textual instances into a shared latent space.
arXiv Detail & Related papers (2024-01-18T14:27:01Z) - Lightweight In-Context Tuning for Multimodal Unified Models [57.10831399642176]
MultiModal In-conteXt Tuning (M$2$IXT) is a lightweight module to enhance the ICL capabilities of multimodal unified models.
When tuned on as little as 50K multimodal data, M$2$IXT can boost the few-shot ICL performance significantly.
arXiv Detail & Related papers (2023-10-08T10:47:24Z) - On the Generalization of Multi-modal Contrastive Learning [21.849681446573257]
We study how MMCL extracts useful visual representation from multi-modal pairs.
We show that text pairs induce more semantically consistent and diverse positive pairs, which, according to our analysis, provably benefit downstream generalization.
Inspired by this finding, we propose CLIP-guided resampling methods to significantly improve the downstream performance of SSCL on ImageNet.
arXiv Detail & Related papers (2023-06-07T09:13:56Z) - Understanding Multimodal Contrastive Learning and Incorporating Unpaired
Data [19.72282903349282]
We show a general class of nonlinear loss functions for multimodal contrastive learning (MMCL)
We quantitatively show that the feature learning ability of MMCL can be better than that of unimodal contrastive learning applied to each modality.
When we have access to additional unpaired data, we propose a new MMCL loss that incorporates additional unpaired datasets.
arXiv Detail & Related papers (2023-02-13T10:11:05Z) - Hierarchical Cross-Modality Semantic Correlation Learning Model for
Multimodal Summarization [4.714335699701277]
Multimodal summarization with multimodal output (MSMO) generates a summary with both textual and visual content.
Traditional MSMO methods indistinguishably handle different modalities of data by learning a representation for the whole data.
We propose a hierarchical cross-modality semantic correlation learning model (HCSCL) to learn the intra- and inter-modal correlation existing in the multimodal data.
arXiv Detail & Related papers (2021-12-16T01:46:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.