Context-Based Multimodal Fusion
- URL: http://arxiv.org/abs/2403.04650v2
- Date: Fri, 8 Mar 2024 14:29:41 GMT
- Title: Context-Based Multimodal Fusion
- Authors: Bilal Faye, Hanane Azzag, Mustapha Lebbah, Djamel Bouchaffra
- Abstract summary: We propose an innovative model called Context-Based Multimodal Fusion (CBMF)
CBMF combines modality fusion and data distribution alignment.
CBMF enables the use of large pre-trained models that can be frozen.
- Score: 0.08192907805418585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The fusion models, which effectively combine information from different
sources, are widely used in solving multimodal tasks. However, they have
significant limitations related to aligning data distributions across different
modalities. This challenge can lead to inconsistencies and difficulties in
learning robust representations. Alignment models, while specifically
addressing this issue, often require training "from scratch" with large
datasets to achieve optimal results, which can be costly in terms of resources
and time. To overcome these limitations, we propose an innovative model called
Context-Based Multimodal Fusion (CBMF), which combines both modality fusion and
data distribution alignment. In CBMF, each modality is represented by a
specific context vector, fused with the embedding of each modality. This
enables the use of large pre-trained models that can be frozen, reducing the
computational and training data requirements. Additionally, the network learns
to differentiate embeddings of different modalities through fusion with context
and aligns data distributions using a contrastive approach for self-supervised
learning. Thus, CBMF offers an effective and economical solution for solving
complex multimodal tasks.
Related papers
- Learning Free Token Reduction for Multi-Modal Large Language Models [3.4026156483879517]
Vision-Language Models (VLMs) have achieved remarkable success across a range of multimodal tasks.
However, their practical deployment is often constrained by high computational costs and prolonged inference times.
We propose a token compression paradigm that operates on both spatial and temporal dimensions.
arXiv Detail & Related papers (2025-01-29T02:52:32Z) - SpecRaGE: Robust and Generalizable Multi-view Spectral Representation Learning [9.393841121141076]
Multi-view representation learning (MvRL) has garnered substantial attention in recent years.
graph Laplacian-based MvRL methods have demonstrated remarkable success in representing multi-view data.
We introduce $textitSpecRaGE$, a novel fusion-based framework that integrates the strengths of graph Laplacian methods with the power of deep learning.
arXiv Detail & Related papers (2024-11-04T14:51:35Z) - Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework [58.362064122489166]
This paper introduces the Cross-modal Few-Shot Learning task, which aims to recognize instances across multiple modalities while relying on scarce labeled data.
We propose a Generative Transfer Learning framework by simulating how humans abstract and generalize concepts.
We show that the GTL achieves state-of-the-art performance across seven multi-modal datasets across RGB-Sketch, RGB-Infrared, and RGB-Depth.
arXiv Detail & Related papers (2024-10-14T16:09:38Z) - Towards Scalable and Versatile Weight Space Learning [51.78426981947659]
This paper introduces the SANE approach to weight-space learning.
Our method extends the idea of hyper-representations towards sequential processing of subsets of neural network weights.
arXiv Detail & Related papers (2024-06-14T13:12:07Z) - All in One Framework for Multimodal Re-identification in the Wild [58.380708329455466]
multimodal learning paradigm for ReID introduced, referred to as All-in-One (AIO)
AIO harnesses a frozen pre-trained big model as an encoder, enabling effective multimodal retrieval without additional fine-tuning.
Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts.
arXiv Detail & Related papers (2024-05-08T01:04:36Z) - Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Pre-training Contextualized World Models with In-the-wild Videos for
Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks.
We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling.
Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Self-paced Multi-grained Cross-modal Interaction Modeling for Referring
Expression Comprehension [21.000045864213327]
referring expression comprehension (REC) generally requires a large amount of multi-grained information of visual and linguistic modalities to realize accurate reasoning.
How to aggregate multi-grained information from different modalities and extract abundant knowledge from hard examples is crucial in the REC task.
We propose a Self-paced Multi-grained Cross-modal Interaction Modeling framework, which improves the language-to-vision localization ability.
arXiv Detail & Related papers (2022-04-21T08:32:47Z) - Embed Everything: A Method for Efficiently Co-Embedding Multi-Modal
Spaces [3.0795668932789524]
We propose a novel and cost-effective HTL strategy for co-embedding multi-modal spaces.
Our method avoids cost inefficiencies by preprocessing embeddings using pretrained models for all components.
We prove the use of this system in a joint image-audio embedding task.
arXiv Detail & Related papers (2021-10-09T15:39:27Z) - MPRNet: Multi-Path Residual Network for Lightweight Image Super
Resolution [2.3576437999036473]
A novel lightweight super resolution network is proposed, which improves the SOTA performance in lightweight SR.
The proposed architecture also contains a new attention mechanism, Two-Fold Attention Module, to maximize the representation ability of the model.
arXiv Detail & Related papers (2020-11-09T17:11:15Z) - Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task.
We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network.
Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.