Cross-modal Learning for Multi-modal Video Categorization
- URL: http://arxiv.org/abs/2003.03501v3
- Date: Sat, 6 Jun 2020 00:36:52 GMT
- Title: Cross-modal Learning for Multi-modal Video Categorization
- Authors: Palash Goyal, Saurabh Sahu, Shalini Ghosh, Chul Lee
- Abstract summary: Multi-modal machine learning (ML) models can process data in multiple modalities.
In this paper, we focus on the problem of video categorization using a multi-modal ML technique.
We show how our proposed multi-modal video categorization models with cross-modal learning out-perform strong state-of-the-art baseline models.
- Score: 24.61762520189921
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal machine learning (ML) models can process data in multiple
modalities (e.g., video, audio, text) and are useful for video content analysis
in a variety of problems (e.g., object detection, scene understanding, activity
recognition). In this paper, we focus on the problem of video categorization
using a multi-modal ML technique. In particular, we have developed a novel
multi-modal ML approach that we call "cross-modal learning", where one modality
influences another but only when there is correlation between the modalities --
for that, we first train a correlation tower that guides the main multi-modal
video categorization tower in the model. We show how this cross-modal principle
can be applied to different types of models (e.g., RNN, Transformer, NetVLAD),
and demonstrate through experiments how our proposed multi-modal video
categorization models with cross-modal learning out-perform strong
state-of-the-art baseline models.
Related papers
- Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - On Uni-Modal Feature Learning in Supervised Multi-Modal Learning [21.822251958013737]
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions.
We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets.
arXiv Detail & Related papers (2023-05-02T07:15:10Z) - MultiViz: An Analysis Benchmark for Visualizing and Understanding
Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages.
We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z) - Exploiting Temporal Coherence for Multi-modal Video Categorization [24.61762520189921]
In this paper, we focus on the problem of video categorization by using a multimodal approach.
We have developed a novel temporal coherence-based regularization approach, which applies to different types of models.
We demonstrate through experiments how our proposed multimodal video categorization models with temporal coherence out-perform strong state-of-the-art baseline models.
arXiv Detail & Related papers (2020-02-07T06:42:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.