Graph-Driven Multimodal Feature Learning Framework for Apparent Personality Assessment
- URL: http://arxiv.org/abs/2504.11515v1
- Date: Tue, 15 Apr 2025 14:26:12 GMT
- Title: Graph-Driven Multimodal Feature Learning Framework for Apparent Personality Assessment
- Authors: Kangsheng Wang, Chengwei Ye, Huanzhen Zhang, Linuo Xu, Shuyan Liu,
- Abstract summary: Predicting personality traits automatically has become a challenging problem in computer vision.<n>This paper introduces an innovative multimodal feature learning framework for personality analysis in short video clips.
- Score: 0.39945675027960637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Predicting personality traits automatically has become a challenging problem in computer vision. This paper introduces an innovative multimodal feature learning framework for personality analysis in short video clips. For visual processing, we construct a facial graph and design a Geo-based two-stream network incorporating an attention mechanism, leveraging both Graph Convolutional Networks (GCN) and Convolutional Neural Networks (CNN) to capture static facial expressions. Additionally, ResNet18 and VGGFace networks are employed to extract global scene and facial appearance features at the frame level. To capture dynamic temporal information, we integrate a BiGRU with a temporal attention module for extracting salient frame representations. To enhance the model's robustness, we incorporate the VGGish CNN for audio-based features and XLM-Roberta for text-based features. Finally, a multimodal channel attention mechanism is introduced to integrate different modalities, and a Multi-Layer Perceptron (MLP) regression model is used to predict personality traits. Experimental results confirm that our proposed framework surpasses existing state-of-the-art approaches in performance.
Related papers
- Interactive Multimodal Fusion with Temporal Modeling [11.506800500772734]
Our approach integrates visual and audio information through a multimodal framework.<n>The visual branch uses a pre-trained ResNet model to extract features from facial images.<n>The audio branches employ pre-trained VGG models to extract VGGish and LogMel features from speech signals.<n>Our method achieves competitive performance on the Aff-Wild2 dataset, demonstrating effective multimodal fusion for VA estimation in-the-wild.
arXiv Detail & Related papers (2025-03-13T16:31:56Z) - DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - Masked Contrastive Graph Representation Learning for Age Estimation [44.96502862249276]
This paper utilizes the property of graph representation learning in dealing with image redundancy information.
We propose a novel Masked Contrastive Graph Representation Learning (MCGRL) method for age estimation.
Experimental results on real-world face image datasets demonstrate the superiority of our proposed method over other state-of-the-art age estimation approaches.
arXiv Detail & Related papers (2023-06-16T15:53:21Z) - MHSCNet: A Multimodal Hierarchical Shot-aware Convolutional Network for
Video Summarization [61.69587867308656]
We propose a multimodal hierarchical shot-aware convolutional network, denoted as MHSCNet, to enhance the frame-wise representation.
Based on the learned shot-aware representations, MHSCNet can predict the frame-level importance score in the local and global view of the video.
arXiv Detail & Related papers (2022-04-18T14:53:33Z) - Spectral Transform Forms Scalable Transformer [1.19071399645846]
This work learns from the philosophy of self-attention and proposes an efficient spectral-based neural unit that employs informative long-range temporal interaction.
The developed spectral window unit (SW) model predicts scalable dynamic graphs with assured efficiency.
arXiv Detail & Related papers (2021-11-15T08:46:01Z) - Wide and Narrow: Video Prediction from Context and Motion [54.21624227408727]
We propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks.
We present global context propagation networks that aggregate the non-local neighboring representations to preserve the contextual information over the past frames.
We also devise local filter memory networks that generate adaptive filter kernels by storing the motion of moving objects in the memory.
arXiv Detail & Related papers (2021-10-22T04:35:58Z) - A Multimodal Framework for Video Ads Understanding [64.70769354696019]
We develop a multimodal system to improve the ability of structured analysis of advertising video content.
Our solution achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.
arXiv Detail & Related papers (2021-08-29T16:06:00Z) - The Mind's Eye: Visualizing Class-Agnostic Features of CNNs [92.39082696657874]
We propose an approach to visually interpret CNN features given a set of images by creating corresponding images that depict the most informative features of a specific layer.
Our method uses a dual-objective activation and distance loss, without requiring a generator network nor modifications to the original model.
arXiv Detail & Related papers (2021-01-29T07:46:39Z) - Probabilistic Graph Attention Network with Conditional Kernels for
Pixel-Wise Prediction [158.88345945211185]
We present a novel approach that advances the state of the art on pixel-level prediction in a fundamental aspect, i.e. structured multi-scale features learning and fusion.
We propose a probabilistic graph attention network structure based on a novel Attention-Gated Conditional Random Fields (AG-CRFs) model for learning and fusing multi-scale representations in a principled manner.
arXiv Detail & Related papers (2021-01-08T04:14:29Z) - Video-based Facial Expression Recognition using Graph Convolutional
Networks [57.980827038988735]
We introduce a Graph Convolutional Network (GCN) layer into a common CNN-RNN based model for video-based facial expression recognition.
We evaluate our method on three widely-used datasets, CK+, Oulu-CASIA and MMI, and also one challenging wild dataset AFEW8.0.
arXiv Detail & Related papers (2020-10-26T07:31:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.