Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities
- URL: http://arxiv.org/abs/2411.01053v1
- Date: Fri, 01 Nov 2024 21:49:25 GMT
- Title: Contrasting with Symile: Simple Model-Agnostic Representation Learning for Unlimited Modalities
- Authors: Adriel Saporta, Aahlad Puli, Mark Goldstein, Rajesh Ranganath,
- Abstract summary: Contrastive learning methods, such as CLIP, leverage naturally paired data-for example, images and their corresponding text captions-to learn general representations that transfer efficiently to downstream tasks.
We show that the pairwise application of CLIP fails to capture joint information between modalities, thereby limiting the quality of the learned representations.
We present Symile, a simple contrastive learning approach that captures higher-order information between any number of modalities.
- Score: 21.745266161874195
- License:
- Abstract: Contrastive learning methods, such as CLIP, leverage naturally paired data-for example, images and their corresponding text captions-to learn general representations that transfer efficiently to downstream tasks. While such approaches are generally applied to two modalities, domains such as robotics, healthcare, and video need to support many types of data at once. We show that the pairwise application of CLIP fails to capture joint information between modalities, thereby limiting the quality of the learned representations. To address this issue, we present Symile, a simple contrastive learning approach that captures higher-order information between any number of modalities. Symile provides a flexible, architecture-agnostic objective for learning modality-specific representations. To develop Symile's objective, we derive a lower bound on total correlation, and show that Symile representations for any set of modalities form a sufficient statistic for predicting the remaining modalities. Symile outperforms pairwise CLIP, even with modalities missing in the data, on cross-modal classification and retrieval across several experiments including on an original multilingual dataset of 33M image, text and audio samples and a clinical dataset of chest X-rays, electrocardiograms, and laboratory measurements. All datasets and code used in this work are publicly available at https://github.com/rajesh-lab/symile.
Related papers
- Semantic Meta-Split Learning: A TinyML Scheme for Few-Shot Wireless Image Classification [50.28867343337997]
This work presents a TinyML-based semantic communication framework for few-shot wireless image classification.
We exploit split-learning to limit the computations performed by the end-users while ensuring privacy-preserving.
meta-learning overcomes data availability concerns and speeds up training by utilizing similarly trained tasks.
arXiv Detail & Related papers (2024-09-03T05:56:55Z) - TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data [6.414759311130015]
We propose TIP, a novel framework for learning multimodal representations robust to incomplete data.
Specifically, TIP investigates a self-supervised learning (SSL) strategy, including a masked reconstruction task for tackling data missingness.
TIP outperforms state-of-the-art supervised/SSL image/multimodal algorithms in both complete and incomplete data scenarios.
arXiv Detail & Related papers (2024-07-10T12:16:15Z) - MLIP: Medical Language-Image Pre-training with Masked Local
Representation Learning [20.33625985769796]
Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs.
We propose a Medical Language-Image Pre-training framework, which exploits the limited image-text medical data more efficiently.
Our evaluation results show that MLIP outperforms previous work in zero/few-shot classification and few-shot segmentation tasks by a large margin.
arXiv Detail & Related papers (2024-01-03T07:54:13Z) - Self-Supervised Neuron Segmentation with Multi-Agent Reinforcement
Learning [53.00683059396803]
Mask image model (MIM) has been widely used due to its simplicity and effectiveness in recovering original information from masked images.
We propose a decision-based MIM that utilizes reinforcement learning (RL) to automatically search for optimal image masking ratio and masking strategy.
Our approach has a significant advantage over alternative self-supervised methods on the task of neuron segmentation.
arXiv Detail & Related papers (2023-10-06T10:40:46Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Momentum Contrastive Voxel-wise Representation Learning for
Semi-supervised Volumetric Medical Image Segmentation [2.3322477552758234]
We present a novel Contrastive Voxel-wise Representation (CVRL) method with geometric constraints to learn global-local visual representations for medical image segmentation.
Our framework can effectively learn global and local features by capturing 3D spatial context and rich anatomical information.
arXiv Detail & Related papers (2021-05-14T20:27:23Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.