Related papers: MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

URL: http://arxiv.org/abs/2303.10839v2
Date: Tue, 21 Mar 2023 02:37:37 GMT
Title: MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations
Authors: Ye Wang, Bowei Jiang, Changqing Zou, Rui Ma
Abstract summary: We propose MXM-CLR, a unified framework for contrastive learning of multifold cross-modal representations. XM-CLR explicitly models and learns the relationships between multifold observations of instances from different modalities. Results show the superiority of MXM-CLR in learning better representations for the multifold data.
Score: 14.355743915598554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multifold observations are common for different data modalities, e.g., a 3D shape can be represented by multi-view images and an image can be described with different captions. Existing cross-modal contrastive representation learning (XM-CLR) methods such as CLIP are not fully suitable for multifold data as they only consider one positive pair and treat other pairs as negative when computing the contrastive loss. In this paper, we propose MXM-CLR, a unified framework for contrastive learning of multifold cross-modal representations. MXM-CLR explicitly models and learns the relationships between multifold observations of instances from different modalities for more comprehensive representation learning. The key of MXM-CLR is a novel multifold-aware hybrid loss which considers multiple positive observations when computing the hard and soft relationships for the cross-modal data pairs. We conduct quantitative and qualitative comparisons with SOTA baselines for cross-modal retrieval tasks on the Text2Shape and Flickr30K datasets. We also perform extensive evaluations on the adaptability and generalizability of MXM-CLR, as well as ablation studies on the loss design and effects of batch sizes. The results show the superiority of MXM-CLR in learning better representations for the multifold data. The code is available at https://github.com/JLU-ICL/MXM-CLR.

Related papers

Retrieval-augmented in-context learning for multimodal large language models in disease classification [18.48849976529677]
RAICL integrates retrieval-augmented generation (RAG) and in-context learning (ICL) to adaptively select demonstrations with similar disease patterns.<n>We evaluated the framework on two real-world multi-modal datasets.
arXiv Detail & Related papers (2025-05-04T12:43:56Z)
CoLLM: A Large Language Model for Composed Image Retrieval [76.29725148964368]
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query. We present CoLLM, a one-stop framework that generates triplets on-the-fly from image-caption pairs. We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts.
arXiv Detail & Related papers (2025-03-25T17:59:50Z)
Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation [61.64052577026623]
Real-world multi-view datasets are often heterogeneous and imperfect. We propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment. In experiments, we employ it in unsupervised multi-view clustering, noise-label classification, and as a plug-and-play module for cross-modal hashing retrieval.
arXiv Detail & Related papers (2025-03-06T07:01:08Z)
Multi-View Factorizing and Disentangling: A Novel Framework for Incomplete Multi-View Multi-Label Classification [9.905528765058541]
We propose a novel framework for incomplete multi-view multi-label classification (iMvMLC) Our method factorizes multi-view representations into two independent sets of factors: view-consistent and view-specific. Our framework innovatively decomposes consistent representation learning into three key sub-objectives.
arXiv Detail & Related papers (2025-01-11T12:19:20Z)
Compositional Image Retrieval via Instruction-Aware Contrastive Learning [40.54022628032561]
Composed Image Retrieval (CIR) involves retrieving a target image based on a composed query of an image paired with text that specifies modifications or changes to the visual reference. In practice, due to the scarcity of annotated data in downstream tasks, Zero-Shot CIR (ZS-CIR) is desirable. We propose a novel embedding method utilizing an instruction-tuned Multimodal LLM (MLLM) to generate composed representation.
arXiv Detail & Related papers (2024-12-07T22:46:52Z)
Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples. We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z)
MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval [73.77101139365912]
We propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. We employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations.
arXiv Detail & Related papers (2024-08-20T06:30:37Z)
MIBench: Evaluating Multimodal Large Language Models over Multiple Images [70.44423964171088]
We propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC) The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs.
arXiv Detail & Related papers (2024-07-21T21:22:58Z)
MMCL: Boosting Deformable DETR-Based Detectors with Multi-Class Min-Margin Contrastive Learning for Superior Prohibited Item Detection [8.23801404004195]
Prohibited Item detection in X-ray images is one of the most effective security inspection methods. overlapping unique phenomena in X-ray images lead to the coupling of foreground and background features. We propose a Multi-Class Min-Margin Contrastive Learning (MMCL) method to clarify the category semantic information of content queries.
arXiv Detail & Related papers (2024-06-05T12:07:58Z)
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models [97.40590590880144]
We develop an extensive Multimodality Large Language Model (MLLM) series. We assemble a comprehensive dataset covering publicly available resources in language, vision, and vision-language tasks. We obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities.
arXiv Detail & Related papers (2024-02-08T18:59:48Z)
CPCL: Cross-Modal Prototypical Contrastive Learning for Weakly Supervised Text-based Person Re-Identification [10.64115914599574]
Weakly supervised text-based person re-identification (TPRe-ID) seeks to retrieve images of a target person using textual descriptions. The primary challenge is the intra-class differences, encompassing intra-modal feature variations and cross-modal semantic gaps. In practice, the CPCL introduces the CLIP model to weakly supervised TPRe-ID for the first time, mapping visual and textual instances into a shared latent space.
arXiv Detail & Related papers (2024-01-18T14:27:01Z)
Lightweight In-Context Tuning for Multimodal Unified Models [57.10831399642176]
MultiModal In-conteXt Tuning (M$2$IXT) is a lightweight module to enhance the ICL capabilities of multimodal unified models. When tuned on as little as 50K multimodal data, M$2$IXT can boost the few-shot ICL performance significantly.
arXiv Detail & Related papers (2023-10-08T10:47:24Z)
On the Generalization of Multi-modal Contrastive Learning [21.849681446573257]
We study how MMCL extracts useful visual representation from multi-modal pairs. We show that text pairs induce more semantically consistent and diverse positive pairs, which, according to our analysis, provably benefit downstream generalization. Inspired by this finding, we propose CLIP-guided resampling methods to significantly improve the downstream performance of SSCL on ImageNet.
arXiv Detail & Related papers (2023-06-07T09:13:56Z)
Understanding Multimodal Contrastive Learning and Incorporating Unpaired Data [19.72282903349282]
We show a general class of nonlinear loss functions for multimodal contrastive learning (MMCL) We quantitatively show that the feature learning ability of MMCL can be better than that of unimodal contrastive learning applied to each modality. When we have access to additional unpaired data, we propose a new MMCL loss that incorporates additional unpaired datasets.
arXiv Detail & Related papers (2023-02-13T10:11:05Z)
Multi-view Multi-behavior Contrastive Learning in Recommendation [52.42597422620091]
Multi-behavior recommendation (MBR) aims to jointly consider multiple behaviors to improve the target behavior's performance. We propose a novel Multi-behavior Multi-view Contrastive Learning Recommendation framework.
arXiv Detail & Related papers (2022-03-20T15:13:28Z)
Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization [4.714335699701277]
Multimodal summarization with multimodal output (MSMO) generates a summary with both textual and visual content. Traditional MSMO methods indistinguishably handle different modalities of data by learning a representation for the whole data. We propose a hierarchical cross-modality semantic correlation learning model (HCSCL) to learn the intra- and inter-modal correlation existing in the multimodal data.
arXiv Detail & Related papers (2021-12-16T01:46:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.