MXM-CLR: A Unified Framework for Contrastive Learning of Multifold
Cross-Modal Representations
- URL: http://arxiv.org/abs/2303.10839v2
- Date: Tue, 21 Mar 2023 02:37:37 GMT
- Title: MXM-CLR: A Unified Framework for Contrastive Learning of Multifold
Cross-Modal Representations
- Authors: Ye Wang, Bowei Jiang, Changqing Zou, Rui Ma
- Abstract summary: We propose MXM-CLR, a unified framework for contrastive learning of multifold cross-modal representations.
XM-CLR explicitly models and learns the relationships between multifold observations of instances from different modalities.
Results show the superiority of MXM-CLR in learning better representations for the multifold data.
- Score: 14.355743915598554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multifold observations are common for different data modalities, e.g., a 3D
shape can be represented by multi-view images and an image can be described
with different captions. Existing cross-modal contrastive representation
learning (XM-CLR) methods such as CLIP are not fully suitable for multifold
data as they only consider one positive pair and treat other pairs as negative
when computing the contrastive loss. In this paper, we propose MXM-CLR, a
unified framework for contrastive learning of multifold cross-modal
representations. MXM-CLR explicitly models and learns the relationships between
multifold observations of instances from different modalities for more
comprehensive representation learning. The key of MXM-CLR is a novel
multifold-aware hybrid loss which considers multiple positive observations when
computing the hard and soft relationships for the cross-modal data pairs. We
conduct quantitative and qualitative comparisons with SOTA baselines for
cross-modal retrieval tasks on the Text2Shape and Flickr30K datasets. We also
perform extensive evaluations on the adaptability and generalizability of
MXM-CLR, as well as ablation studies on the loss design and effects of batch
sizes. The results show the superiority of MXM-CLR in learning better
representations for the multifold data. The code is available at
https://github.com/JLU-ICL/MXM-CLR.
Related papers
- MIBench: Evaluating Multimodal Large Language Models over Multiple Images [70.44423964171088]
We propose a new benchmark MIBench to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios.
MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks with a total of 13K annotated samples.
The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs.
arXiv Detail & Related papers (2024-07-21T21:22:58Z) - MMRel: A Relation Understanding Dataset and Benchmark in the MLLM Era [72.95901753186227]
Multi-Modal Relation Understanding (MMRel) is a comprehensive dataset for studying inter-object relations with Multi-modal Large Language Models (MLLMs)
MMRel features three distinctive attributes: (i) It includes over 15K question-answer pairs, which are sourced from three distinct domains, ensuring large scale and high diversity; (ii) It contains a subset featuring highly unusual relations, on which MLLMs often fail due to hallucinations, thus are very challenging; (iii) It provides manually verified high-quality labels for inter-object relations.
arXiv Detail & Related papers (2024-06-13T13:51:59Z) - MMCL: Boosting Deformable DETR-Based Detectors with Multi-Class Min-Margin Contrastive Learning for Superior Prohibited Item Detection [8.23801404004195]
Prohibited Item detection in X-ray images is one of the most effective security inspection methods.
overlapping unique phenomena in X-ray images lead to the coupling of foreground and background features.
We propose a Multi-Class Min-Margin Contrastive Learning (MMCL) method to clarify the category semantic information of content queries.
arXiv Detail & Related papers (2024-06-05T12:07:58Z) - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models [97.40590590880144]
We develop an extensive Multimodality Large Language Model (MLLM) series.
We assemble a comprehensive dataset covering publicly available resources in language, vision, and vision-language tasks.
We obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities.
arXiv Detail & Related papers (2024-02-08T18:59:48Z) - CPCL: Cross-Modal Prototypical Contrastive Learning for Weakly
Supervised Text-based Person Re-Identification [10.64115914599574]
Weakly supervised text-based person re-identification (TPRe-ID) seeks to retrieve images of a target person using textual descriptions.
The primary challenge is the intra-class differences, encompassing intra-modal feature variations and cross-modal semantic gaps.
In practice, the CPCL introduces the CLIP model to weakly supervised TPRe-ID for the first time, mapping visual and textual instances into a shared latent space.
arXiv Detail & Related papers (2024-01-18T14:27:01Z) - Lightweight In-Context Tuning for Multimodal Unified Models [57.10831399642176]
MultiModal In-conteXt Tuning (M$2$IXT) is a lightweight module to enhance the ICL capabilities of multimodal unified models.
When tuned on as little as 50K multimodal data, M$2$IXT can boost the few-shot ICL performance significantly.
arXiv Detail & Related papers (2023-10-08T10:47:24Z) - On the Generalization of Multi-modal Contrastive Learning [21.849681446573257]
We study how MMCL extracts useful visual representation from multi-modal pairs.
We show that text pairs induce more semantically consistent and diverse positive pairs, which, according to our analysis, provably benefit downstream generalization.
Inspired by this finding, we propose CLIP-guided resampling methods to significantly improve the downstream performance of SSCL on ImageNet.
arXiv Detail & Related papers (2023-06-07T09:13:56Z) - Multi-view Multi-behavior Contrastive Learning in Recommendation [52.42597422620091]
Multi-behavior recommendation (MBR) aims to jointly consider multiple behaviors to improve the target behavior's performance.
We propose a novel Multi-behavior Multi-view Contrastive Learning Recommendation framework.
arXiv Detail & Related papers (2022-03-20T15:13:28Z) - Hierarchical Cross-Modality Semantic Correlation Learning Model for
Multimodal Summarization [4.714335699701277]
Multimodal summarization with multimodal output (MSMO) generates a summary with both textual and visual content.
Traditional MSMO methods indistinguishably handle different modalities of data by learning a representation for the whole data.
We propose a hierarchical cross-modality semantic correlation learning model (HCSCL) to learn the intra- and inter-modal correlation existing in the multimodal data.
arXiv Detail & Related papers (2021-12-16T01:46:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.