MAGIC-TBR: Multiview Attention Fusion for Transformer-based Bodily
Behavior Recognition in Group Settings
- URL: http://arxiv.org/abs/2309.10765v1
- Date: Tue, 19 Sep 2023 17:04:36 GMT
- Title: MAGIC-TBR: Multiview Attention Fusion for Transformer-based Bodily
Behavior Recognition in Group Settings
- Authors: Surbhi Madan, Rishabh Jain, Gulshan Sharma, Ramanathan Subramanian and
Abhinav Dhall
- Abstract summary: This paper proposes a multiview attention fusion method named MAGIC-TBR that combines features extracted from videos and their corresponding Discrete Cosine Transform coefficients via a transformer-based approach.
Experiments are conducted on the BBSI dataset and the results demonstrate the effectiveness of the proposed feature fusion with multiview attention.
- Score: 9.185580170954802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Bodily behavioral language is an important social cue, and its automated
analysis helps in enhancing the understanding of artificial intelligence
systems. Furthermore, behavioral language cues are essential for active
engagement in social agent-based user interactions. Despite the progress made
in computer vision for tasks like head and body pose estimation, there is still
a need to explore the detection of finer behaviors such as gesturing, grooming,
or fumbling. This paper proposes a multiview attention fusion method named
MAGIC-TBR that combines features extracted from videos and their corresponding
Discrete Cosine Transform coefficients via a transformer-based approach. The
experiments are conducted on the BBSI dataset and the results demonstrate the
effectiveness of the proposed feature fusion with multiview attention. The code
is available at: https://github.com/surbhimadan92/MAGIC-TBR
Related papers
- Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Computation and Parameter Efficient Multi-Modal Fusion Transformer for
Cued Speech Recognition [48.84506301960988]
Cued Speech (CS) is a pure visual coding method used by hearing-impaired people.
automatic CS recognition (ACSR) seeks to transcribe visual cues of speech into text.
arXiv Detail & Related papers (2024-01-31T05:20:29Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Egocentric RGB+Depth Action Recognition in Industry-Like Settings [50.38638300332429]
Our work focuses on recognizing actions from egocentric RGB and Depth modalities in an industry-like environment.
Our framework is based on the 3D Video SWIN Transformer to encode both RGB and Depth modalities effectively.
Our method also secured first place at the multimodal action recognition challenge at ICIAP 2023.
arXiv Detail & Related papers (2023-09-25T08:56:22Z) - ICAFusion: Iterative Cross-Attention Guided Feature Fusion for
Multispectral Object Detection [25.66305300362193]
A novel feature fusion framework of dual cross-attention transformers is proposed to model global feature interaction.
This framework enhances the discriminability of object features through the query-guided cross-attention mechanism.
The proposed method achieves superior performance and faster inference, making it suitable for various practical scenarios.
arXiv Detail & Related papers (2023-08-15T00:02:10Z) - Multimodal Vision Transformers with Forced Attention for Behavior
Analysis [0.0]
We introduce the Forced Attention (FAt) Transformer which utilize forced attention with a modified backbone for input encoding and a use of additional inputs.
FAt Transformers are applied to two downstream tasks: personality recognition and body language recognition.
We achieve state-of-the-art results for Udiva v0.5, First Impressions v2 and MPII Group Interaction datasets.
arXiv Detail & Related papers (2022-12-07T21:56:50Z) - Self-Attention Attribution: Interpreting Information Interactions Inside
Transformer [89.21584915290319]
We propose a self-attention attribution method to interpret the information interactions inside Transformer.
We show that the attribution results can be used as adversarial patterns to implement non-targeted attacks towards BERT.
arXiv Detail & Related papers (2020-04-23T14:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.