Deep Multi-Modal Sets
- URL: http://arxiv.org/abs/2003.01607v1
- Date: Tue, 3 Mar 2020 15:48:44 GMT
- Title: Deep Multi-Modal Sets
- Authors: Austin Reiter, Menglin Jia, Pu Yang, Ser-Nam Lim
- Abstract summary: Deep Multi-Modal Sets is a technique that represents a collection of features as an unordered set rather than one long ever-growing fixed-size vector.
We demonstrate a scalable, multi-modal framework that reasons over different modalities to learn various types of tasks.
- Score: 29.983311598563542
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many vision-related tasks benefit from reasoning over multiple modalities to
leverage complementary views of data in an attempt to learn robust embedding
spaces. Most deep learning-based methods rely on a late fusion technique
whereby multiple feature types are encoded and concatenated and then a multi
layer perceptron (MLP) combines the fused embedding to make predictions. This
has several limitations, such as an unnatural enforcement that all features be
present at all times as well as constraining only a constant number of
occurrences of a feature modality at any given time. Furthermore, as more
modalities are added, the concatenated embedding grows. To mitigate this, we
propose Deep Multi-Modal Sets: a technique that represents a collection of
features as an unordered set rather than one long ever-growing fixed-size
vector. The set is constructed so that we have invariance both to permutations
of the feature modalities as well as to the cardinality of the set. We will
also show that with particular choices in our model architecture, we can yield
interpretable feature performance such that during inference time we can
observe which modalities are most contributing to the prediction.With this in
mind, we demonstrate a scalable, multi-modal framework that reasons over
different modalities to learn various types of tasks. We demonstrate new
state-of-the-art performance on two multi-modal datasets (Ads-Parallelity [34]
and MM-IMDb [1]).
Related papers
- SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.
It is designed to accurately detect horizontal or oriented objects from any sensor modality.
This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - MambaPro: Multi-Modal Object Re-Identification with Mamba Aggregation and Synergistic Prompt [60.10555128510744]
Multi-modal object Re-IDentification (ReID) aims to retrieve specific objects by utilizing complementary image information from different modalities.
Recently, large-scale pre-trained models like CLIP have demonstrated impressive performance in traditional single-modal object ReID tasks.
We introduce a novel framework called MambaPro for multi-modal object ReID.
arXiv Detail & Related papers (2024-12-14T06:33:53Z) - DeMo: Decoupled Feature-Based Mixture of Experts for Multi-Modal Object Re-Identification [25.781336502845395]
Multi-modal object ReIDentification aims to retrieve specific objects by combining complementary information from multiple modalities.
We propose a novel feature learning framework called DeMo for multi-modal object ReID, which adaptively balances decoupled features using a mixture of experts.
arXiv Detail & Related papers (2024-12-14T02:36:56Z) - Multimodal Difference Learning for Sequential Recommendation [5.243083216855681]
We argue that user interests and item relationships vary across different modalities.
We propose a novel Multimodal Learning framework for Sequential Recommendation, MDSRec.
Results on five real-world datasets demonstrate the superiority of MDSRec over state-of-the-art baselines.
arXiv Detail & Related papers (2024-12-11T05:08:19Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Lightweight In-Context Tuning for Multimodal Unified Models [57.10831399642176]
MultiModal In-conteXt Tuning (M$2$IXT) is a lightweight module to enhance the ICL capabilities of multimodal unified models.
When tuned on as little as 50K multimodal data, M$2$IXT can boost the few-shot ICL performance significantly.
arXiv Detail & Related papers (2023-10-08T10:47:24Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.