Related papers: Multimodal Vision Transformers with Forced Attention for Behavior Analysis

Multimodal Vision Transformers with Forced Attention for Behavior Analysis

URL: http://arxiv.org/abs/2212.03968v1
Date: Wed, 7 Dec 2022 21:56:50 GMT
Title: Multimodal Vision Transformers with Forced Attention for Behavior Analysis
Authors: Tanay Agrawal, Michal Balazia, Philipp M\"uller, Fran\c{c}ois Br\'emond
Abstract summary: We introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs. FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition. We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human behavior understanding requires looking at minute details in the large context of a scene containing multiple input modalities. It is necessary as it allows the design of more human-like machines. While transformer approaches have shown great improvements, they face multiple challenges such as lack of data or background noise. To tackle these, we introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs. In addition to improving the performance on different tasks and inputs, the modification requires less time and memory resources. We provide a model for a generalised feature extraction for tasks concerning social signals and behavior analysis. Our focus is on understanding behavior in videos where people are interacting with each other or talking into the camera which simulates the first person point of view in social interaction. FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition. We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets. We further provide an extensive ablation study of the proposed architecture.

Related papers

DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction. The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z)
Towards a Unified Transformer-based Framework for Scene Graph Generation and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z)
ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection [20.983998911754792]
Two-stage Human-Object Interaction (HOI) detectors suffer from lower performance than one-stage methods. We propose Vision Transformer based Pose-Conditioned Self-Loop Graph (ViPLO) to resolve these problems. ViPLO achieves the state-of-the-art results on two public benchmarks.
arXiv Detail & Related papers (2023-04-17T09:44:54Z)
Multimodal Personality Recognition using Cross-Attention Transformer and Behaviour Encoding [0.0]
We propose a flexible model for the task which exploits all available data. The task involves complex relations and to avoid using a large model for video processing specifically, we propose the use of behaviour encoding.
arXiv Detail & Related papers (2021-12-22T19:14:55Z)
Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions. We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions. Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z)
Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence. Transformers require minimal inductive biases for their design and are naturally suited as set-functions. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task. We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame. Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.