Related papers: GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation

GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation

URL: http://arxiv.org/abs/2505.03846v2
Date: Sat, 31 May 2025 09:08:51 GMT
Title: GAME: Learning Multimodal Interactions via Graph Structures for Personality Trait Estimation
Authors: Kangsheng Wang, Yuhang Li, Chengwei Ye, Yufei Lin, Huanzhen Zhang, Bohan Hu, Linuo Xu, Shuyan Liu,
Abstract summary: Apparent personality analysis from short videos poses significant chal-lenges due to the complex interplay of visual, auditory, and textual cues.<n>In this paper, we propose GAME, a Graph-Augmented Multimodalvolution are designed to robustly model and fuse multi-source features for automatic personality prediction.<n>For the visual stream, we construct a facial graph and introduce a dual-branch Geo Two-Stream Network, which combines Graph Convolutional Networks (GCNs) and Convolutional Neural Net-works (CNNs)<n>To capture temporal dynamics, frame-level features are processed by a BiG
Score: 13.071227081328288
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Apparent personality analysis from short videos poses significant chal-lenges due to the complex interplay of visual, auditory, and textual cues. In this paper, we propose GAME, a Graph-Augmented Multimodal Encoder designed to robustly model and fuse multi-source features for automatic personality prediction. For the visual stream, we construct a facial graph and introduce a dual-branch Geo Two-Stream Network, which combines Graph Convolutional Networks (GCNs) and Convolutional Neural Net-works (CNNs) with attention mechanisms to capture both structural and appearance-based facial cues. Complementing this, global context and iden-tity features are extracted using pretrained ResNet18 and VGGFace back-bones. To capture temporal dynamics, frame-level features are processed by a BiGRU enhanced with temporal attention modules. Meanwhile, audio representations are derived from the VGGish network, and linguistic se-mantics are captured via the XLM-Roberta transformer. To achieve effective multimodal integration, we propose a Channel Attention-based Fusion module, followed by a Multi-Layer Perceptron (MLP) regression head for predicting personality traits. Extensive experiments show that GAME con-sistently outperforms existing methods across multiple benchmarks, vali-dating its effectiveness and generalizability.

Related papers

MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
Graph-Driven Multimodal Feature Learning Framework for Apparent Personality Assessment [0.39945675027960637]
Predicting personality traits automatically has become a challenging problem in computer vision.<n>This paper introduces an innovative multimodal feature learning framework for personality analysis in short video clips.
arXiv Detail & Related papers (2025-04-15T14:26:12Z)
Interactive Multimodal Fusion with Temporal Modeling [11.506800500772734]
Our approach integrates visual and audio information through a multimodal framework.<n>The visual branch uses a pre-trained ResNet model to extract features from facial images.<n>The audio branches employ pre-trained VGG models to extract VGGish and LogMel features from speech signals.<n>Our method achieves competitive performance on the Aff-Wild2 dataset, demonstrating effective multimodal fusion for VA estimation in-the-wild.
arXiv Detail & Related papers (2025-03-13T16:31:56Z)
MVCNet: Multi-View Contrastive Network for Motor Imagery Classification [20.78236894605647]
Motor imagery (MI) decoding has received significant attention due to its intuitive mechanism.<n>Most existing models rely on single-stream architectures and overlook the multi-view nature of EEG signals, leading to limited performance and generalization.<n>We propose a multi-view contrastive network (MVCNet), a dual-branch architecture that parallelly integrates CNN and Transformer models to capture both local spatial-temporal features and global temporal dependencies.
arXiv Detail & Related papers (2025-02-18T10:30:53Z)
Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.<n>Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.<n>We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z)
DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.<n>DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.<n>Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z)
MTS2Graph: Interpretable Multivariate Time Series Classification with Temporal Evolving Graphs [1.1756822700775666]
We introduce a new framework for interpreting time series data by extracting and clustering the input representative patterns. We run experiments on eight datasets of the UCR/UEA archive, along with HAR and PAM datasets.
arXiv Detail & Related papers (2023-06-06T16:24:27Z)
Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID. Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z)
Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision. A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive. We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z)
Event and Activity Recognition in Video Surveillance for Cyber-Physical Systems [0.0]
Long-term motion patterns alone play a pivotal role in the task of recognizing an event. We show that the long-term motion patterns alone play a pivotal role in the task of recognizing an event. Only the temporal features are exploited using a hybrid Convolutional Neural Network (CNN) + Recurrent Neural Network (RNN) architecture.
arXiv Detail & Related papers (2021-11-03T08:30:38Z)
Video-based Facial Expression Recognition using Graph Convolutional Networks [57.980827038988735]
We introduce a Graph Convolutional Network (GCN) layer into a common CNN-RNN based model for video-based facial expression recognition. We evaluate our method on three widely-used datasets, CK+, Oulu-CASIA and MMI, and also one challenging wild dataset AFEW8.0.
arXiv Detail & Related papers (2020-10-26T07:31:51Z)
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task. We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame. Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.