Related papers: Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation

Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation

URL: http://arxiv.org/abs/2503.04151v2
Date: Thu, 24 Jul 2025 09:25:16 GMT
Title: Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation
Authors: Jie Xu, Na Zhao, Gang Niu, Masashi Sugiyama, Xiaofeng Zhu,
Abstract summary: Real-world multi-view datasets are often heterogeneous and imperfect.<n>We propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment.<n>Our RML is self-supervised and can also be applied for downstream tasks as a regularization.
Score: 61.64052577026623
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, multi-view learning (MVL) has garnered significant attention due to its ability to fuse discriminative information from multiple views. However, real-world multi-view datasets are often heterogeneous and imperfect, which usually causes MVL methods designed for specific combinations of views to lack application potential and limits their effectiveness. To address this issue, we propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment. Specifically, we introduce a simple yet effective multi-view transformer fusion network where we transform heterogeneous multi-view data into homogeneous word embeddings, and then integrate multiple views by the sample-level attention mechanism to obtain a fused representation. Furthermore, we propose a simulated perturbation based multi-view contrastive learning framework that dynamically generates the noise and unusable perturbations for simulating imperfect data conditions. The simulated noisy and unusable data obtain two distinct fused representations, and we utilize contrastive learning to align them for learning discriminative and robust representations. Our RML is self-supervised and can also be applied for downstream tasks as a regularization. In experiments, we employ it in multi-view unsupervised clustering, noise-label classification, and as a plug-and-play module for cross-modal hashing retrieval. Extensive comparison experiments and ablation studies validate RML's effectiveness. Code is available at https://github.com/SubmissionsIn/RML.

Related papers

MACL: Multi-Label Adaptive Contrastive Learning Loss for Remote Sensing Image Retrieval [4.411658619208916]
Multi-Label Adaptive Contrastive Learning (MACL) is introduced as an extension of contrastive learning to address them.<n>It integrates label-aware sampling, frequency-sensitive weighting, and dynamic-temperature scaling to achieve balanced representation learning across both common and rare categories.
arXiv Detail & Related papers (2025-12-18T08:29:27Z)
Hierarchical Identity Learning for Unsupervised Visible-Infrared Person Re-Identification [81.3063589622217]
Unsupervised visible-infrared person re-identification (USVI-ReID) aims to learn modality-invariant image features from unlabeled cross-modal person datasets.
arXiv Detail & Related papers (2025-09-15T05:10:43Z)
Automatically Identify and Rectify: Robust Deep Contrastive Multi-view Clustering in Noisy Scenarios [76.02688769599686]
We propose a novel multi-view clustering framework for the automatic identification and rectification of noisy data, termed AIRMVC.<n>Specifically, we reformulate noisy identification as an anomaly identification problem using GMM.<n>We then design a hybrid rectification strategy to mitigate the adverse effects of noisy data based on the identification results.
arXiv Detail & Related papers (2025-05-27T16:16:54Z)
COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking [52.62149024881728]
We propose a contrastive one-stage transformer fusion framework for vision-language (VL) tracking. We introduce a contrastive alignment strategy that maximizes mutual information between a video and its corresponding language description. By leveraging a visual-linguistic transformer, we establish an efficient multi-modal fusion and reasoning mechanism.
arXiv Detail & Related papers (2025-04-02T03:12:38Z)
Incomplete Multi-view Clustering via Diffusion Contrastive Generation [10.303281347345955]
We propose a novel IMVC method called Diffusion Contrastive Generation (DCG) DCG learns the distribution characteristics to enhance clustering by applying forward diffusion and reverse denoising processes to intra-view data. It integrates instance-level and category-level interactive learning to exploit the consistent and complementary information available in multi-view data.
arXiv Detail & Related papers (2025-03-12T09:27:25Z)
Generalizable and Robust Spectral Method for Multi-view Representation Learning [9.393841121141076]
Multi-view representation learning (MvRL) has garnered substantial attention in recent years.<n> graph Laplacian-based MvRL methods have demonstrated remarkable success in representing multi-view data.<n>We introduce $textitSpecRaGE$, a novel fusion-based framework that integrates the strengths of graph Laplacian methods with the power of deep learning.
arXiv Detail & Related papers (2024-11-04T14:51:35Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback. First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF. Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z)
MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training [9.023648972811458]
RagVL is a novel framework with knowledge-enhanced reranking and noise-injected training. We instruction-tune the MLLM with a simple yet effective instruction template to induce its ranking ability. For generation, we inject visual noise during training at the data and token levels to enhance the generator's robustness.
arXiv Detail & Related papers (2024-07-31T08:43:17Z)
Hierarchical Mutual Information Analysis: Towards Multi-view Clustering in The Wild [9.380271109354474]
This work proposes a deep MVC framework where data recovery and alignment are fused in a hierarchically consistent way to maximize the mutual information among different views. To the best of our knowledge, this could be the first successful attempt to handle the missing and unaligned data problem separately with different learning paradigms.
arXiv Detail & Related papers (2023-10-28T06:43:57Z)
Multi-View Class Incremental Learning [57.14644913531313]
Multi-view learning (MVL) has gained great success in integrating information from multiple perspectives of a dataset to improve downstream task performance. This paper investigates a novel paradigm called multi-view class incremental learning (MVCIL), where a single model incrementally classifies new classes from a continual stream of views.
arXiv Detail & Related papers (2023-06-16T08:13:41Z)
Deep Incomplete Multi-view Clustering with Cross-view Partial Sample and Prototype Alignment [50.82982601256481]
We propose a Cross-view Partial Sample and Prototype Alignment Network (CPSPAN) for Deep Incomplete Multi-view Clustering. Unlike existing contrastive-based methods, we adopt pair-observed data alignment as 'proxy supervised signals' to guide instance-to-instance correspondence construction.
arXiv Detail & Related papers (2023-03-28T02:31:57Z)
A Clustering-guided Contrastive Fusion for Multi-view Representation Learning [7.630965478083513]
We propose a deep fusion network to fuse view-specific representations into the view-common representation. We also design an asymmetrical contrastive strategy that aligns the view-common representation and each view-specific representation. In the incomplete view scenario, our proposed method resists noise interference better than those of our competitors.
arXiv Detail & Related papers (2022-12-28T07:21:05Z)
MORI-RAN: Multi-view Robust Representation Learning via Hybrid Contrastive Fusion [4.36488705757229]
Multi-view representation learning is essential for many multi-view tasks, such as clustering and classification. We propose a hybrid contrastive fusion algorithm to extract robust view-common representation from unlabeled data. Experimental results demonstrated that the proposed method outperforms 12 competitive multi-view methods on four real-world datasets.
arXiv Detail & Related papers (2022-08-26T09:58:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.