Multi-View Fusion Transformer for Sensor-Based Human Activity
Recognition
- URL: http://arxiv.org/abs/2202.12949v1
- Date: Wed, 16 Feb 2022 07:15:22 GMT
- Title: Multi-View Fusion Transformer for Sensor-Based Human Activity
Recognition
- Authors: Yimu Wang, Kun Yu, Yan Wang, Hui Xue
- Abstract summary: Sensor-based human activity recognition (HAR) aims to recognize human activities based on the availability of rich time-series data collected from multi-modal sensors such as accelerometers and gyroscopes.
Recent deep learning methods are focusing on one view of the data, i.e., the temporal view, while shallow methods tend to utilize the hand-craft features for recognition, e.g., the statistics view.
We propose a novel method, namely multi-view fusion transformer (MVFT) along with a novel attention mechanism.
- Score: 15.845205542668472
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As a fundamental problem in ubiquitous computing and machine learning,
sensor-based human activity recognition (HAR) has drawn extensive attention and
made great progress in recent years. HAR aims to recognize human activities
based on the availability of rich time-series data collected from multi-modal
sensors such as accelerometers and gyroscopes. However, recent deep learning
methods are focusing on one view of the data, i.e., the temporal view, while
shallow methods tend to utilize the hand-craft features for recognition, e.g.,
the statistics view. In this paper, to extract a better feature for advancing
the performance, we propose a novel method, namely multi-view fusion
transformer (MVFT) along with a novel attention mechanism. First, MVFT encodes
three views of information, i.e., the temporal, frequent, and statistical views
to generate multi-view features. Second, the novel attention mechanism uncovers
inner- and cross-view clues to catalyze mutual interactions between three views
for detailed relation modeling. Moreover, extensive experiments on two datasets
illustrate the superiority of our methods over several state-of-the-art
methods.
Related papers
- From CNNs to Transformers in Multimodal Human Action Recognition: A Survey [23.674123304219822]
Human action recognition is one of the most widely studied research problems in Computer Vision.
Recent studies have shown that addressing it using multimodal data leads to superior performance.
Recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task.
arXiv Detail & Related papers (2024-05-22T02:11:18Z) - Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - Multimodal Visual-Tactile Representation Learning through
Self-Supervised Contrastive Pre-Training [0.850206009406913]
MViTac is a novel methodology that leverages contrastive learning to integrate vision and touch sensations in a self-supervised fashion.
By availing both sensory inputs, MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction.
arXiv Detail & Related papers (2024-01-22T15:11:57Z) - Two Approaches to Supervised Image Segmentation [55.616364225463066]
The present work develops comparison experiments between deep learning and multiset neurons approaches.
The deep learning approach confirmed its potential for performing image segmentation.
The alternative multiset methodology allowed for enhanced accuracy while requiring little computational resources.
arXiv Detail & Related papers (2023-07-19T16:42:52Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision.
This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z) - UMSNet: An Universal Multi-sensor Network for Human Activity Recognition [10.952666953066542]
This paper proposes a universal multi-sensor network (UMSNet) for human activity recognition.
In particular, we propose a new lightweight sensor residual block (called LSR block), which improves the performance.
Our framework has a clear structure and can be directly applied to various types of multi-modal Time Series Classification tasks.
arXiv Detail & Related papers (2022-05-24T03:29:54Z) - Self-Attention Neural Bag-of-Features [103.70855797025689]
We build on the recently introduced 2D-Attention and reformulate the attention learning methodology.
We propose a joint feature-temporal attention mechanism that learns a joint 2D attention mask highlighting relevant information.
arXiv Detail & Related papers (2022-01-26T17:54:14Z) - Inertial Sensor Data To Image Encoding For Human Action Recognition [0.0]
Convolutional Neural Networks (CNNs) are successful deep learning models in the field of computer vision.
In this paper, we use 4 types of spatial domain methods for transforming inertial sensor data to activity images.
For creating a multimodal fusion framework, we made each type of activity images multimodal by convolving with two spatial domain filters.
arXiv Detail & Related papers (2021-05-28T01:22:52Z) - Collaborative Attention Mechanism for Multi-View Action Recognition [75.33062629093054]
We propose a collaborative attention mechanism (CAM) for solving the multi-view action recognition problem.
The proposed CAM detects the attention differences among multi-view, and adaptively integrates frame-level information to benefit each other.
Experiments on four action datasets illustrate the proposed CAM achieves better results for each view and also boosts multi-view performance.
arXiv Detail & Related papers (2020-09-14T17:33:10Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.