Deep Multi-Modal Sets
- URL: http://arxiv.org/abs/2003.01607v1
- Date: Tue, 3 Mar 2020 15:48:44 GMT
- Title: Deep Multi-Modal Sets
- Authors: Austin Reiter, Menglin Jia, Pu Yang, Ser-Nam Lim
- Abstract summary: Deep Multi-Modal Sets is a technique that represents a collection of features as an unordered set rather than one long ever-growing fixed-size vector.
We demonstrate a scalable, multi-modal framework that reasons over different modalities to learn various types of tasks.
- Score: 29.983311598563542
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many vision-related tasks benefit from reasoning over multiple modalities to
leverage complementary views of data in an attempt to learn robust embedding
spaces. Most deep learning-based methods rely on a late fusion technique
whereby multiple feature types are encoded and concatenated and then a multi
layer perceptron (MLP) combines the fused embedding to make predictions. This
has several limitations, such as an unnatural enforcement that all features be
present at all times as well as constraining only a constant number of
occurrences of a feature modality at any given time. Furthermore, as more
modalities are added, the concatenated embedding grows. To mitigate this, we
propose Deep Multi-Modal Sets: a technique that represents a collection of
features as an unordered set rather than one long ever-growing fixed-size
vector. The set is constructed so that we have invariance both to permutations
of the feature modalities as well as to the cardinality of the set. We will
also show that with particular choices in our model architecture, we can yield
interpretable feature performance such that during inference time we can
observe which modalities are most contributing to the prediction.With this in
mind, we demonstrate a scalable, multi-modal framework that reasons over
different modalities to learn various types of tasks. We demonstrate new
state-of-the-art performance on two multi-modal datasets (Ads-Parallelity [34]
and MM-IMDb [1]).
Related papers
- U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Lightweight In-Context Tuning for Multimodal Unified Models [57.10831399642176]
MultiModal In-conteXt Tuning (M$2$IXT) is a lightweight module to enhance the ICL capabilities of multimodal unified models.
When tuned on as little as 50K multimodal data, M$2$IXT can boost the few-shot ICL performance significantly.
arXiv Detail & Related papers (2023-10-08T10:47:24Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z) - Learning Deep Multimodal Feature Representation with Asymmetric
Multi-layer Fusion [63.72912507445662]
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network.
We verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder.
Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively.
arXiv Detail & Related papers (2021-08-11T03:42:13Z) - Representing Unordered Data Using Complex-Weighted Multiset Automata [23.68657135308002]
We show how the multiset representations of certain existing neural architectures can be viewed as special cases of ours.
Namely, we provide a new theoretical and intuitive justification for the Transformer model's representation of positions using sinusoidal functions.
We extend the DeepSets model to use complex numbers, enabling it to outperform the existing model on an extension of one of their tasks.
arXiv Detail & Related papers (2020-01-02T20:04:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.