Multimodal Fusion with Pre-Trained Model Features in Affective Behaviour Analysis In-the-wild
- URL: http://arxiv.org/abs/2403.15044v1
- Date: Fri, 22 Mar 2024 09:00:24 GMT
- Title: Multimodal Fusion with Pre-Trained Model Features in Affective Behaviour Analysis In-the-wild
- Authors: Zhuofan Wen, Fengyu Zhang, Siyuan Zhang, Haiyang Sun, Mingyu Xu, Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao,
- Abstract summary: We present an approach for addressing the task of Expression (Expr) Recognition and Valence-Arousal (VA) Estimation.
We evaluate the Aff-Wild2 database using pre-trained models, then extract the final hidden layers of the models as features.
Following preprocessing and or convolution to align the extracted features, different models are employed for modal fusion.
- Score: 37.32217405723552
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal fusion is a significant method for most multimodal tasks. With the recent surge in the number of large pre-trained models, combining both multimodal fusion methods and pre-trained model features can achieve outstanding performance in many multimodal tasks. In this paper, we present our approach, which leverages both advantages for addressing the task of Expression (Expr) Recognition and Valence-Arousal (VA) Estimation. We evaluate the Aff-Wild2 database using pre-trained models, then extract the final hidden layers of the models as features. Following preprocessing and interpolation or convolution to align the extracted features, different models are employed for modal fusion. Our code is available at GitHub - FulgenceWen/ABAW6th.
Related papers
- QARM: Quantitative Alignment Multi-Modal Recommendation at Kuaishou [23.818456863262494]
We introduce a quantitative multi-modal framework to customize the specialized and trainable multi-modal information for different downstream models.
Inspired by the two difficulties challenges in downstream tasks usage, we introduce a quantitative multi-modal framework to customize the specialized and trainable multi-modal information for different downstream models.
arXiv Detail & Related papers (2024-11-18T17:08:35Z) - Learning Multimodal Latent Generative Models with Energy-Based Prior [3.6648642834198797]
We propose a novel framework that integrates the latent generative model with the EBM.
This approach results in a more expressive and informative prior, better-capturing of information across multiple modalities.
arXiv Detail & Related papers (2024-09-30T01:38:26Z) - Knowledge Fusion By Evolving Weights of Language Models [5.354527640064584]
This paper examines the approach of integrating multiple models into a unified model.
We propose a knowledge fusion method named Evolver, inspired by evolutionary algorithms.
arXiv Detail & Related papers (2024-06-18T02:12:34Z) - FusionBench: A Comprehensive Benchmark of Deep Model Fusion [78.80920533793595]
Deep model fusion is a technique that unifies the predictions or parameters of several deep neural networks into a single model.
FusionBench is the first comprehensive benchmark dedicated to deep model fusion.
arXiv Detail & Related papers (2024-06-05T13:54:28Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models [51.5543321122664]
This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
arXiv Detail & Related papers (2023-10-08T15:01:54Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution.
We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture.
We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Multimodal E-Commerce Product Classification Using Hierarchical Fusion [0.0]
The proposed method significantly outperformed the unimodal models' performance and the reported performance of similar models on our specific task.
We did experiments with multiple fusing techniques and found, that the best performing technique to combine the individual embedding of the unimodal network is based on combining concatenation and averaging the feature vectors.
arXiv Detail & Related papers (2022-07-07T14:04:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.