Multimodal Vision Transformers with Forced Attention for Behavior
Analysis
- URL: http://arxiv.org/abs/2212.03968v1
- Date: Wed, 7 Dec 2022 21:56:50 GMT
- Title: Multimodal Vision Transformers with Forced Attention for Behavior
Analysis
- Authors: Tanay Agrawal, Michal Balazia, Philipp M\"uller, Fran\c{c}ois
Br\'emond
- Abstract summary: We introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs.
FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition.
We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human behavior understanding requires looking at minute details in the large
context of a scene containing multiple input modalities. It is necessary as it
allows the design of more human-like machines. While transformer approaches
have shown great improvements, they face multiple challenges such as lack of
data or background noise. To tackle these, we introduce the Forced Attention
(FAt) Transformer which utilize forced attention with a modified backbone for
input encoding and a use of additional inputs. In addition to improving the
performance on different tasks and inputs, the modification requires less time
and memory resources. We provide a model for a generalised feature extraction
for tasks concerning social signals and behavior analysis. Our focus is on
understanding behavior in videos where people are interacting with each other
or talking into the camera which simulates the first person point of view in
social interaction. FAt Transformers are applied to two downstream tasks:
personality recognition and body language recognition. We achieve
state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group
Interaction datasets. We further provide an extensive ablation study of the
proposed architecture.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z) - ViPLO: Vision Transformer based Pose-Conditioned Self-Loop Graph for
Human-Object Interaction Detection [20.983998911754792]
Two-stage Human-Object Interaction (HOI) detectors suffer from lower performance than one-stage methods.
We propose Vision Transformer based Pose-Conditioned Self-Loop Graph (ViPLO) to resolve these problems.
ViPLO achieves the state-of-the-art results on two public benchmarks.
arXiv Detail & Related papers (2023-04-17T09:44:54Z) - Multimodal Personality Recognition using Cross-Attention Transformer and
Behaviour Encoding [0.0]
We propose a flexible model for the task which exploits all available data.
The task involves complex relations and to avoid using a large model for video processing specifically, we propose the use of behaviour encoding.
arXiv Detail & Related papers (2021-12-22T19:14:55Z) - Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions.
We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.
Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.