STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification
- URL: http://arxiv.org/abs/2503.06277v3
- Date: Sat, 15 Mar 2025 15:31:28 GMT
- Title: STiL: Semi-supervised Tabular-Image Learning for Comprehensive Task-Relevant Information Exploration in Multimodal Classification
- Authors: Siyi Du, Xinzhe Luo, Declan P. O'Regan, Chen Qin,
- Abstract summary: Multimodal image-tabular learning is gaining attention, yet it faces challenges due to limited labeled data.<n>Semi-supervised learning (SemiSL), which combines labeled and unlabeled data, offers a promising solution.<n>We propose STiL, a novel SemiSL framework that addresses a Modality Information Gap by comprehensively exploring task-relevant information.
- Score: 6.130981749820211
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal image-tabular learning is gaining attention, yet it faces challenges due to limited labeled data. While earlier work has applied self-supervised learning (SSL) to unlabeled data, its task-agnostic nature often results in learning suboptimal features for downstream tasks. Semi-supervised learning (SemiSL), which combines labeled and unlabeled data, offers a promising solution. However, existing multimodal SemiSL methods typically focus on unimodal or modality-shared features, ignoring valuable task-relevant modality-specific information, leading to a Modality Information Gap. In this paper, we propose STiL, a novel SemiSL tabular-image framework that addresses this gap by comprehensively exploring task-relevant information. STiL features a new disentangled contrastive consistency module to learn cross-modal invariant representations of shared information while retaining modality-specific information via disentanglement. We also propose a novel consensus-guided pseudo-labeling strategy to generate reliable pseudo-labels based on classifier consensus, along with a new prototype-guided label smoothing technique to refine pseudo-label quality with prototype embeddings, thereby enhancing task-relevant information learning in unlabeled data. Experiments on natural and medical image datasets show that STiL outperforms the state-of-the-art supervised/SSL/SemiSL image/multimodal approaches. Our code is available at https://github.com/siyi-wind/STiL.
Related papers
- Enhancing Semi-supervised Learning with Noisy Zero-shot Pseudolabels [3.1614158472531435]
We present ZMT (Zero-Shot Multi-Task Learning), a framework that jointly optimize zero-shot pseudo-labels and unsupervised representation learning objectives.
Our method introduces a multi-task learning-based mechanism that incorporates pseudo-labels while ensuring robustness to varying pseudo-label quality.
Experiments across 8 datasets in vision, language, and audio domains demonstrate that ZMT reduces error by up to 56% compared to traditional SSL methods.
arXiv Detail & Related papers (2025-02-18T06:41:53Z) - Context-Based Semantic-Aware Alignment for Semi-Supervised Multi-Label Learning [37.13424985128905]
Vision-language models pre-trained on large-scale image-text pairs could alleviate the challenge of limited labeled data under SSMLL setting.<n>We propose a context-based semantic-aware alignment method to solve the SSMLL problem.
arXiv Detail & Related papers (2024-12-25T09:06:54Z) - An Information Criterion for Controlled Disentanglement of Multimodal Data [39.601584166020274]
Multimodal representation learning seeks to relate and decompose information inherent in multiple modalities.
Disentangled Self-Supervised Learning (DisentangledSSL) is a novel self-supervised approach for learning disentangled representations.
arXiv Detail & Related papers (2024-10-31T14:57:31Z) - Exploiting the Semantic Knowledge of Pre-trained Text-Encoders for Continual Learning [70.64617500380287]
Continual learning allows models to learn from new data while retaining previously learned knowledge.
The semantic knowledge available in the label information of the images, offers important semantic information that can be related with previously acquired knowledge of semantic classes.
We propose integrating semantic guidance within and across tasks by capturing semantic similarity using text embeddings.
arXiv Detail & Related papers (2024-08-02T07:51:44Z) - TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data [6.414759311130015]
We propose TIP, a novel framework for learning multimodal representations robust to incomplete data.
Specifically, TIP investigates a self-supervised learning (SSL) strategy, including a masked reconstruction task for tackling data missingness.
TIP outperforms state-of-the-art supervised/SSL image/multimodal algorithms in both complete and incomplete data scenarios.
arXiv Detail & Related papers (2024-07-10T12:16:15Z) - FlexSSL : A Generic and Efficient Framework for Semi-Supervised Learning [19.774959310191623]
We develop a generic and efficient learning framework called FlexSSL.
We show that FlexSSL can consistently enhance the performance of semi-supervised learning algorithms.
arXiv Detail & Related papers (2023-12-28T08:31:56Z) - CroSSL: Cross-modal Self-Supervised Learning for Time-series through
Latent Masking [11.616031590118014]
CroSSL allows for handling missing modalities and end-to-end cross-modal learning.
We evaluate our method on a wide range of data, including motion sensors.
arXiv Detail & Related papers (2023-07-31T17:10:10Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Information Symmetry Matters: A Modal-Alternating Propagation Network
for Few-Shot Learning [118.45388912229494]
We propose a Modal-Alternating Propagation Network (MAP-Net) to supplement the absent semantic information of unlabeled samples.
We design a Relation Guidance (RG) strategy to guide the visual relation vectors via semantics so that the propagated information is more beneficial.
Our proposed method achieves promising performance and outperforms the state-of-the-art approaches.
arXiv Detail & Related papers (2021-09-03T03:43:53Z) - Trash to Treasure: Harvesting OOD Data with Cross-Modal Matching for
Open-Set Semi-Supervised Learning [101.28281124670647]
Open-set semi-supervised learning (open-set SSL) investigates a challenging but practical scenario where out-of-distribution (OOD) samples are contained in the unlabeled data.
We propose a novel training mechanism that could effectively exploit the presence of OOD data for enhanced feature learning.
Our approach substantially lifts the performance on open-set SSL and outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2021-08-12T09:14:44Z) - Generalized Zero-Shot Learning using Multimodal Variational Auto-Encoder
with Semantic Concepts [0.9054540533394924]
Recent techniques try to learn a cross-modal mapping between the semantic space and the image space.
We propose a Multimodal Variational Auto-Encoder (M-VAE) which can learn the shared latent space of image features and the semantic space.
Our results show that our proposed model outperforms the current state-of-the-art approaches for generalized zero-shot learning.
arXiv Detail & Related papers (2021-06-26T20:08:37Z) - Information Bottleneck Constrained Latent Bidirectional Embedding for
Zero-Shot Learning [59.58381904522967]
We propose a novel embedding based generative model with a tight visual-semantic coupling constraint.
We learn a unified latent space that calibrates the embedded parametric distributions of both visual and semantic spaces.
Our method can be easily extended to transductive ZSL setting by generating labels for unseen images.
arXiv Detail & Related papers (2020-09-16T03:54:12Z) - Density-Aware Graph for Deep Semi-Supervised Visual Recognition [102.9484812869054]
Semi-supervised learning (SSL) has been extensively studied to improve the generalization ability of deep neural networks for visual recognition.
This paper proposes to solve the SSL problem by building a novel density-aware graph, based on which the neighborhood information can be easily leveraged.
arXiv Detail & Related papers (2020-03-30T02:52:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.