Related papers: Cross-modal Learning for Multi-modal Video Categorization

Cross-modal Learning for Multi-modal Video Categorization

URL: http://arxiv.org/abs/2003.03501v3
Date: Sat, 6 Jun 2020 00:36:52 GMT
Title: Cross-modal Learning for Multi-modal Video Categorization
Authors: Palash Goyal, Saurabh Sahu, Shalini Ghosh, Chul Lee
Abstract summary: Multi-modal machine learning (ML) models can process data in multiple modalities. In this paper, we focus on the problem of video categorization using a multi-modal ML technique. We show how our proposed multi-modal video categorization models with cross-modal learning out-perform strong state-of-the-art baseline models.
Score: 24.61762520189921
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal machine learning (ML) models can process data in multiple modalities (e.g., video, audio, text) and are useful for video content analysis in a variety of problems (e.g., object detection, scene understanding, activity recognition). In this paper, we focus on the problem of video categorization using a multi-modal ML technique. In particular, we have developed a novel multi-modal ML approach that we call "cross-modal learning", where one modality influences another but only when there is correlation between the modalities -- for that, we first train a correlation tower that guides the main multi-modal video categorization tower in the model. We show how this cross-modal principle can be applied to different types of models (e.g., RNN, Transformer, NetVLAD), and demonstrate through experiments how our proposed multi-modal video categorization models with cross-modal learning out-perform strong state-of-the-art baseline models.

Related papers

MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning [12.821814562210632]
This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach.<n>We propose a new Invertible Cross-Attention layer to develop the Normalizing Flow-based Model for multimodal data.<n>We also introduce three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA)
arXiv Detail & Related papers (2025-08-13T18:56:57Z)
MultiSHAP: A Shapley-Based Framework for Explaining Cross-Modal Interactions in Multimodal AI Models [5.011371514152517]
Multimodal AI models have achieved impressive performance in tasks that require integrating information from multiple modalities, such as vision and language.<n>How to explain cross-modal interactions in multimodal AI models remains a major challenge.
arXiv Detail & Related papers (2025-08-01T12:19:18Z)
Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [103.98582374569789]
Model merging aims to combine multiple expert models into a single model, thereby reducing storage and serving costs.<n>Previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks.<n>We introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models.
arXiv Detail & Related papers (2025-05-26T12:23:14Z)
Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond [48.43910061720815]
Multi-modal generative AI has received increasing attention in both academia and industry. One natural question arises: Is it possible to have a unified model for both understanding and generation?
arXiv Detail & Related papers (2024-09-23T13:16:09Z)
Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment. Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z)
Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning. MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process. It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities. Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z)
Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning. We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z)
Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z)
On Uni-Modal Feature Learning in Supervised Multi-Modal Learning [21.822251958013737]
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions. We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets.
arXiv Detail & Related papers (2023-05-02T07:15:10Z)
MultiViz: An Analysis Benchmark for Visualizing and Understanding Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages. We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z)
Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis. Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z)
Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data. Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
Exploiting Temporal Coherence for Multi-modal Video Categorization [24.61762520189921]
In this paper, we focus on the problem of video categorization by using a multimodal approach. We have developed a novel temporal coherence-based regularization approach, which applies to different types of models. We demonstrate through experiments how our proposed multimodal video categorization models with temporal coherence out-perform strong state-of-the-art baseline models.
arXiv Detail & Related papers (2020-02-07T06:42:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.