Related papers: Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics

Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics

URL: http://arxiv.org/abs/2510.04753v1
Date: Mon, 06 Oct 2025 12:31:15 GMT
Title: Beyond Appearance: Transformer-based Person Identification from Conversational Dynamics
Authors: Masoumeh Chapariniya, Teodora Vukovic, Sarah Ebling, Volker Dellwo,
Abstract summary: We implement and evaluate a two-stream framework that separately models spatial configurations and temporal motion patterns of 133 COCO WholeBody keypoints.<n>Our experiments compare pre-trained and from-scratch training, investigate the use of velocity features, and introduce a multi-scale temporal transformer for hierarchical motion modeling.<n>Results demonstrate that domain-specific training significantly outperforms transfer learning, and that spatial configurations carry more discriminative information than temporal dynamics.
Score: 5.920768116941384
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper investigates the performance of transformer-based architectures for person identification in natural, face-to-face conversation scenario. We implement and evaluate a two-stream framework that separately models spatial configurations and temporal motion patterns of 133 COCO WholeBody keypoints, extracted from a subset of the CANDOR conversational corpus. Our experiments compare pre-trained and from-scratch training, investigate the use of velocity features, and introduce a multi-scale temporal transformer for hierarchical motion modeling. Results demonstrate that domain-specific training significantly outperforms transfer learning, and that spatial configurations carry more discriminative information than temporal dynamics. The spatial transformer achieves 95.74% accuracy, while the multi-scale temporal transformer achieves 93.90%. Feature-level fusion pushes performance to 98.03%, confirming that postural and dynamic information are complementary. These findings highlight the potential of transformer architectures for person identification in natural interactions and provide insights for future multimodal and cross-cultural studies.

Related papers

Multivariate Long-term Time Series Forecasting with Fourier Neural Filter [42.60778405812048]
We introduce FNF as the backbone and DBD as architecture to provide excellent learning capabilities and optimal learning pathways for spatial-temporal modeling.<n>We show that FNF unifies local time-domain and global frequency-domain information processing within a single backbone that extends naturally to spatial modeling.
arXiv Detail & Related papers (2025-06-10T18:40:20Z)
Knowledge-enhanced Transformer for Multivariate Long Sequence Time-series Forecasting [4.645182684813973]
We introduce a novel approach that encapsulates conceptual relationships among variables within a well-defined knowledge graph. We investigate the influence of this integration into seminal architectures such as PatchTST, Autoformer, Informer, and Vanilla Transformer. This enhancement empowers transformer-based architectures to address the inherent structural relation between variables.
arXiv Detail & Related papers (2024-11-17T11:53:54Z)
Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z)
TempoFormer: A Transformer for Temporally-aware Representations in Change Detection [12.063146420389371]
We introduce TempoFormer, the first task-agnostic transformer-based and temporally-aware model for dynamic representation learning. Our approach is jointly trained on inter and intra context dynamics and introduces a novel temporal variation of rotary positional embeddings. We show new SOTA performance on three different real-time change detection tasks.
arXiv Detail & Related papers (2024-08-28T10:25:53Z)
Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models. SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z)
A Theoretical Analysis of Self-Supervised Learning for Vision Transformers [66.08606211686339]
Masked autoencoders (MAE) and contrastive learning (CL) capture different types of representations.<n>We study the training dynamics of one-layer softmax-based vision transformers (ViTs) on both MAE and CL objectives.
arXiv Detail & Related papers (2024-03-04T17:24:03Z)
Learning Causal Domain-Invariant Temporal Dynamics for Few-Shot Action Recognition [12.522600594024112]
Few-shot action recognition aims at quickly adapting a pre-trained model to novel data. Key challenges include how to identify and leverage the transferable knowledge learned by the pre-trained model. We propose CDTD, or Causal Domain-Invariant Temporal Dynamics for knowledge transfer.
arXiv Detail & Related papers (2024-02-20T04:09:58Z)
Towards a Unified Transformer-based Framework for Scene Graph Generation and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z)
Pose Uncertainty Aware Movement Synchrony Estimation via Spatial-Temporal Graph Transformer [7.053333608725945]
Movement synchrony reflects the coordination of body movements between interacting dyads. This paper proposes a skeleton-based graph transformer for movement synchrony estimation. Our method achieved an overall accuracy of 88.98% and surpassed its counterparts by a wide margin.
arXiv Detail & Related papers (2022-08-01T22:35:32Z)
Exploring Structure-aware Transformer over Interaction Proposals for Human-Object Interaction Detection [119.93025368028083]
We design a novel Transformer-style Human-Object Interaction (HOI) detector, i.e., Structure-aware Transformer over Interaction Proposals (STIP) STIP decomposes the process of HOI set prediction into two subsequent phases, i.e., an interaction proposal generation is first performed, and then followed by transforming the non-parametric interaction proposals into HOI predictions via a structure-aware Transformer. The structure-aware Transformer upgrades vanilla Transformer by encoding additionally the holistically semantic structure among interaction proposals as well as the locally spatial structure of human/object within each interaction proposal, so as to strengthen HOI
arXiv Detail & Related papers (2022-06-13T16:21:08Z)
Stochastic Layers in Vision Transformers [85.38733795180497]
We introduce fully layers in vision transformers, without causing any severe drop in performance. The additionality boosts the robustness of visual features and strengthens privacy. We use our features for three different applications, namely, adversarial robustness, network calibration, and feature privacy.
arXiv Detail & Related papers (2021-12-30T16:07:59Z)
Variational Predictive Routing with Nested Subjective Timescales [1.6114012813668934]
We present Variational Predictive Routing (PRV) - a neural inference system that organizes latent video features in a temporal hierarchy. We show that VPR is able to detect event boundaries, disentangletemporal features, adapt to the dynamics hierarchy of the data, and produce accurate time-agnostic rollouts of the future.
arXiv Detail & Related papers (2021-10-21T16:12:59Z)
Multiple Object Tracking with Correlation Learning [16.959379957515974]
We propose to exploit the local correlation module to model the topological relationship between targets and their surrounding environment. Specifically, we establish dense correspondences of each spatial location and its context, and explicitly constrain the correlation volumes through self-supervised learning. Our approach demonstrates the effectiveness of correlation learning with the superior performance and obtains state-of-the-art MOTA of 76.5% and IDF1 of 73.6% on MOT17.
arXiv Detail & Related papers (2021-04-08T06:48:02Z)
AutoTrans: Automating Transformer Design via Reinforced Architecture Search [52.48985245743108]
This paper empirically explore how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc, so that one can obtain a transformer architecture that better suits the tasks at hand. Experiments on the CoNLL03, Multi-30k, IWSLT14 and WMT-14 shows that the searched transformer model can outperform the standard transformers.
arXiv Detail & Related papers (2020-09-04T08:46:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.