CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning
- URL: http://arxiv.org/abs/2602.19605v1
- Date: Mon, 23 Feb 2026 08:47:19 GMT
- Title: CLCR: Cross-Level Semantic Collaborative Representation for Multimodal Learning
- Authors: Chunlei Meng, Guanhong Huang, Rong Fu, Runmin Jian, Zhongxue Gan, Chun Ouyang,
- Abstract summary: Multimodal learning aims to capture both shared and private information from multiple modalities.<n>Existing methods that project all modalities into a single latent space for fusion often overlook the asynchronous, multi-level semantic structure of multimodal data.<n>We propose Cross-Level Co-Representation (CLCR), which explicitly organizes each modality's features into a three-level semantic hierarchy.
- Score: 10.210493389825116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal learning aims to capture both shared and private information from multiple modalities. However, existing methods that project all modalities into a single latent space for fusion often overlook the asynchronous, multi-level semantic structure of multimodal data. This oversight induces semantic misalignment and error propagation, thereby degrading representation quality. To address this issue, we propose Cross-Level Co-Representation (CLCR), which explicitly organizes each modality's features into a three-level semantic hierarchy and specifies level-wise constraints for cross-modal interactions. First, a semantic hierarchy encoder aligns shallow, mid, and deep features across modalities, establishing a common basis for interaction. And then, at each level, an Intra-Level Co-Exchange Domain (IntraCED) factorizes features into shared and private subspaces and restricts cross-modal attention to the shared subspace via a learnable token budget. This design ensures that only shared semantics are exchanged and prevents leakage from private channels. To integrate information across levels, the Inter-Level Co-Aggregation Domain (InterCAD) synchronizes semantic scales using learned anchors, selectively fuses the shared representations, and gates private cues to form a compact task representation. We further introduce regularization terms to enforce separation of shared and private features and to minimize cross-level interference. Experiments on six benchmarks spanning emotion recognition, event localization, sentiment analysis, and action recognition show that CLCR achieves strong performance and generalizes well across tasks.
Related papers
- FSCA-Net: Feature-Separated Cross-Attention Network for Robust Multi-Dataset Training [3.2658295979028753]
We propose a unified framework that disentangles feature representations into domain-invariant and domain-specific components.<n>A novel cross-attention fusion module adaptively models interactions between these components, ensuring effective knowledge transfer.<n>Experiments on multiple crowd counting benchmarks demonstrate that FSCA-Net effectively mitigates negative transfer and achieves state-of-the-art cross-dataset generalization.
arXiv Detail & Related papers (2026-02-02T02:18:48Z) - Multi-label Classification with Panoptic Context Aggregation Networks [61.82285737410154]
This paper introduces the Deep Panoptic Context Aggregation Network (PanCAN), a novel approach that hierarchically integrates multi-order geometric contexts.<n>PanCAN learns multi-order neighborhood relationships at each scale by combining random walks with an attention mechanism.<n>Experiments on NUS-WIDE, PASCAL VOC,2007, and MS-COCO benchmarks demonstrate that PanCAN consistently achieves competitive results.
arXiv Detail & Related papers (2025-12-29T14:16:21Z) - MOSAIC: Multi-Subject Personalized Generation via Correspondence-Aware Alignment and Disentanglement [13.100620283631557]
We present MOSAIC, a representation-centric framework that rethinks multi-subject generation.<n>Our key insight is that multi-subject generation requires precise semantic alignment at the representation level.<n>We propose the semantic correspondence attention loss to enforce precise point-to-point semantic alignment.
arXiv Detail & Related papers (2025-09-02T05:40:07Z) - FocalClick-XL: Towards Unified and High-quality Interactive Segmentation [30.83143881909766]
This paper revisits the classical coarse-to-fine design of FocalClick.<n>Inspired by its multi-stage strategy, we propose a novel pipeline, FocalClick-XL.<n>It is capable of predicting alpha mattes with fine-grained details, making it a versatile and powerful tool for interactive segmentation.
arXiv Detail & Related papers (2025-06-17T16:21:32Z) - Distributional Vision-Language Alignment by Cauchy-Schwarz Divergence [83.15764564701706]
We propose a novel framework that performs vision-language alignment by integrating Cauchy-Schwarz divergence with mutual information.<n>We find that the CS divergence seamlessly addresses the InfoNCE's alignment-uniformity conflict and serves complementary roles with InfoNCE.<n> Experiments on text-to-image generation and cross-modality retrieval tasks demonstrate the effectiveness of our method on vision-language alignment.
arXiv Detail & Related papers (2025-02-24T10:29:15Z) - Generalizable Heterogeneous Federated Cross-Correlation and Instance
Similarity Learning [60.058083574671834]
This paper presents a novel FCCL+, federated correlation and similarity learning with non-target distillation.
For heterogeneous issue, we leverage irrelevant unlabeled public data for communication.
For catastrophic forgetting in local updating stage, FCCL+ introduces Federated Non Target Distillation.
arXiv Detail & Related papers (2023-09-28T09:32:27Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - Perceiving the Invisible: Proposal-Free Amodal Panoptic Segmentation [13.23676270963484]
Amodal panoptic segmentation aims to connect the perception of the world to its cognitive understanding.
We formulate a proposal-free framework that tackles this task as a multi-label and multi-class problem.
We propose the net architecture that incorporates a shared backbone and an asymmetrical dual-decoder.
arXiv Detail & Related papers (2022-05-29T12:05:07Z) - Beyond the Prototype: Divide-and-conquer Proxies for Few-shot
Segmentation [63.910211095033596]
Few-shot segmentation aims to segment unseen-class objects given only a handful of densely labeled samples.
We propose a simple yet versatile framework in the spirit of divide-and-conquer.
Our proposed approach, named divide-and-conquer proxies (DCP), allows for the development of appropriate and reliable information.
arXiv Detail & Related papers (2022-04-21T06:21:14Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.