Is end-to-end learning enough for fitness activity recognition?
- URL: http://arxiv.org/abs/2305.08191v1
- Date: Sun, 14 May 2023 16:00:03 GMT
- Title: Is end-to-end learning enough for fitness activity recognition?
- Authors: Antoine Mercier and Guillaume Berger and Sunny Panchal and Florian
Letsch and Cornelius Boehm and Nahua Kang and Ingo Bax and Roland Memisevic
- Abstract summary: We show that end-to-end learning can compete with state-of-the-art action recognition pipelines based on pose estimation.
We also show that end-to-end learning can support temporally fine-grained tasks such as real-time repetition counting.
- Score: 2.4273770300720012
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: End-to-end learning has taken hold of many computer vision tasks, in
particular, related to still images, with task-specific optimization yielding
very strong performance. Nevertheless, human-centric action recognition is
still largely dominated by hand-crafted pipelines, and only individual
components are replaced by neural networks that typically operate on individual
frames. As a testbed to study the relevance of such pipelines, we present a new
fully annotated video dataset of fitness activities. Any recognition
capabilities in this domain are almost exclusively a function of human poses
and their temporal dynamics, so pose-based solutions should perform well. We
show that, with this labelled data, end-to-end learning on raw pixels can
compete with state-of-the-art action recognition pipelines based on pose
estimation. We also show that end-to-end learning can support temporally
fine-grained tasks such as real-time repetition counting.
Related papers
- Towards Learning Discrete Representations via Self-Supervision for
Wearables-Based Human Activity Recognition [7.086647707011785]
Human activity recognition (HAR) in wearable computing is typically based on direct processing of sensor data.
Recent advancements in Vector Quantization (VQ) to wearables applications enables us to directly learn a mapping between short spans of sensor data and a codebook of vectors.
This work presents a proof-of-concept for demonstrating how effective discrete representations can be derived.
arXiv Detail & Related papers (2023-06-01T19:49:43Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - Muscle Vision: Real Time Keypoint Based Pose Classification of Physical
Exercises [52.77024349608834]
3D human pose recognition extrapolated from video has advanced to the point of enabling real-time software applications.
We propose a new machine learning pipeline and web interface that performs human pose recognition on a live video feed to detect when common exercises are performed and classify them accordingly.
arXiv Detail & Related papers (2022-03-23T00:55:07Z) - Human-like Relational Models for Activity Recognition in Video [8.87742125296885]
Video activity recognition by deep neural networks is impressive for many classes.
Deep neural networks can struggle to learn critical relationships effectively.
We propose a more human-like approach to activity recognition, which interprets a video in sequential temporal phases.
We apply the method to a challenging subset of the something-something dataset and achieve a more robust performance against neural network baselines on challenging activities.
arXiv Detail & Related papers (2021-07-12T11:13:17Z) - Joint Learning of Neural Transfer and Architecture Adaptation for Image
Recognition [77.95361323613147]
Current state-of-the-art visual recognition systems rely on pretraining a neural network on a large-scale dataset and finetuning the network weights on a smaller dataset.
In this work, we prove that dynamically adapting network architectures tailored for each domain task along with weight finetuning benefits in both efficiency and effectiveness.
Our method can be easily generalized to an unsupervised paradigm by replacing supernet training with self-supervised learning in the source domain tasks and performing linear evaluation in the downstream tasks.
arXiv Detail & Related papers (2021-03-31T08:15:17Z) - Sense and Learn: Self-Supervision for Omnipresent Sensors [9.442811508809994]
We present a framework named Sense and Learn for representation or feature learning from raw sensory data.
It consists of several auxiliary tasks that can learn high-level and broadly useful features entirely from unannotated data without any human involvement in the tedious labeling process.
Our methodology achieves results that are competitive with the supervised approaches and close the gap through fine-tuning a network while learning the downstream tasks in most cases.
arXiv Detail & Related papers (2020-09-28T11:57:43Z) - Collaborative Distillation in the Parameter and Spectrum Domains for
Video Action Recognition [79.60708268515293]
This paper explores how to train small and efficient networks for action recognition.
We propose two distillation strategies in the frequency domain, namely the feature spectrum and parameter distribution distillations respectively.
Our method can achieve higher performance than state-of-the-art methods with the same backbone.
arXiv Detail & Related papers (2020-09-15T07:29:57Z) - Getting to 99% Accuracy in Interactive Segmentation [18.207714624149595]
Recent deep-learning based interactive segmentation algorithms have made significant progress in handling complex images.
Yet, deep learning techniques tend to plateau once this rough selection has been reached.
We propose a novel interactive architecture and a novel training scheme that are both tailored to better exploit the user workflow.
arXiv Detail & Related papers (2020-03-17T20:50:22Z) - ZSTAD: Zero-Shot Temporal Activity Detection [107.63759089583382]
We propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected.
We design an end-to-end deep network based on R-C3D as the architecture for this solution.
Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities.
arXiv Detail & Related papers (2020-03-12T02:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.