Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic
Interactions
- URL: http://arxiv.org/abs/2109.09487v1
- Date: Mon, 20 Sep 2021 12:45:04 GMT
- Title: Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic
Interactions
- Authors: David Curto, Albert Clap\'es, Javier Selva, Sorina Smeureanu, Julio C.
S. Jacques Junior, David Gallardo-Pujol, Georgina Guilera, David Leiva,
Thomas B. Moeslund, Sergio Escalera and Cristina Palmero
- Abstract summary: We present Dyadformer, a novel multi-modal multi-subject Transformer architecture to model individual and interpersonal features in dyadic interactions.
Our proposed cross-subject layer allows the network to explicitly model interactions among subjects through attentional operations.
This proof-of-concept approach shows how multi-modality and joint modeling of both interactants for longer periods of time helps to predict individual attributes.
- Score: 33.67477398036821
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Personality computing has become an emerging topic in computer vision, due to
the wide range of applications it can be used for. However, most works on the
topic have focused on analyzing the individual, even when applied to
interaction scenarios, and for short periods of time. To address these
limitations, we present the Dyadformer, a novel multi-modal multi-subject
Transformer architecture to model individual and interpersonal features in
dyadic interactions using variable time windows, thus allowing the capture of
long-term interdependencies. Our proposed cross-subject layer allows the
network to explicitly model interactions among subjects through attentional
operations. This proof-of-concept approach shows how multi-modality and joint
modeling of both interactants for longer periods of time helps to predict
individual attributes. With Dyadformer, we improve state-of-the-art
self-reported personality inference results on individual subjects on the UDIVA
v0.5 dataset.
Related papers
- Personality Analysis from Online Short Video Platforms with Multi-domain Adaptation [16.555668668581237]
Personality analysis from online short videos has gained prominence due to its applications in personalized recommendation systems, sentiment analysis, and human-computer interaction.
Traditional assessment methods, such as questionnaires based on the Big Five Personality Framework, are limited by self-report biases and are impractical for large-scale or real-time analysis.
We propose a novel multi-modal personality analysis framework that addresses challenges by synchronizing and integrating features from multiple modalities.
arXiv Detail & Related papers (2024-10-26T03:29:32Z) - Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering [53.39158264785098]
Long-term Video Question Answering (VideoQA) is a challenging vision-and-language bridging task.
We present an entirely end-to-end solution for VideoQA: Multi-granularity Contrastive cross-modal collaborative Generation model.
arXiv Detail & Related papers (2024-10-12T06:21:58Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in
Group Conversations [39.79734528362605]
Multimodal Attention Network captures cross-modal interactions at various levels of spatial abstraction.
AMuSE model condenses both spatial and temporal features into two dense descriptors: speaker-level and utterance-level.
arXiv Detail & Related papers (2024-01-26T19:17:05Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - UNIMO-3: Multi-granularity Interaction for Vision-Language
Representation Learning [35.88753097105914]
We propose the UNIMO-3 model, which has the capacity to simultaneously learn the multimodal in-layer interaction and cross-layer interaction.
Our model achieves state-of-the-art performance in various downstream tasks, and through ablation study can prove that effective cross-layer learning improves the model's ability of multimodal representation.
arXiv Detail & Related papers (2023-05-23T05:11:34Z) - A Probabilistic Model Of Interaction Dynamics for Dyadic Face-to-Face
Settings [1.9544213396776275]
We develop a probabilistic model to capture the interaction dynamics between pairs of participants in a face-to-face setting.
This interaction encoding is then used to influence the generation when predicting one agent's future dynamics.
We show that our model successfully delineates between the modes, based on their interacting dynamics.
arXiv Detail & Related papers (2022-07-10T23:31:27Z) - MultiViz: An Analysis Benchmark for Visualizing and Understanding
Multimodal Models [103.9987158554515]
MultiViz is a method for analyzing the behavior of multimodal models by scaffolding the problem of interpretability into 4 stages.
We show that the complementary stages in MultiViz together enable users to simulate model predictions, assign interpretable concepts to features, perform error analysis on model misclassifications, and use insights from error analysis to debug models.
arXiv Detail & Related papers (2022-06-30T18:42:06Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z) - Multi-Agent Imitation Learning with Copulas [102.27052968901894]
Multi-agent imitation learning aims to train multiple agents to perform tasks from demonstrations by learning a mapping between observations and actions.
In this paper, we propose to use copula, a powerful statistical tool for capturing dependence among random variables, to explicitly model the correlation and coordination in multi-agent systems.
Our proposed model is able to separately learn marginals that capture the local behavioral patterns of each individual agent, as well as a copula function that solely and fully captures the dependence structure among agents.
arXiv Detail & Related papers (2021-07-10T03:49:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.