Related papers: (LiFT) Lightweight Fitness Transformer: A language-vision model for Remote Monitoring of Physical Training

(LiFT) Lightweight Fitness Transformer: A language-vision model for Remote Monitoring of Physical Training

URL: http://arxiv.org/abs/2506.06480v1
Date: Fri, 06 Jun 2025 19:07:06 GMT
Title: (LiFT) Lightweight Fitness Transformer: A language-vision model for Remote Monitoring of Physical Training
Authors: A. Postlmayr, P. Cosman, S. Dey,
Abstract summary: We introduce a fitness tracking system that enables remote monitoring for exercises using only a RGB smartphone camera.<n>Our model can detect exercises with 76.5% accuracy and count repetitions with 85.3% off-by-one accuracy, using only RGB video.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We introduce a fitness tracking system that enables remote monitoring for exercises using only a RGB smartphone camera, making fitness tracking more private, scalable, and cost effective. Although prior work explored automated exercise supervision, existing models are either too limited in exercise variety or too complex for real-world deployment. Prior approaches typically focus on a small set of exercises and fail to generalize across diverse movements. In contrast, we develop a robust, multitask motion analysis model capable of performing exercise detection and repetition counting across hundreds of exercises, a scale far beyond previous methods. We overcome previous data limitations by assembling a large-scale fitness dataset, Olympia covering more than 1,900 exercises. To our knowledge, our vision-language model is the first that can perform multiple tasks on skeletal fitness data. On Olympia, our model can detect exercises with 76.5% accuracy and count repetitions with 85.3% off-by-one accuracy, using only RGB video. By presenting a single vision-language transformer model for both exercise identification and rep counting, we take a significant step toward democratizing AI-powered fitness tracking.

Related papers

Is Diversity All You Need for Scalable Robotic Manipulation? [50.747150672933316]
We investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better"<n>We show that task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios.<n>We propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data.
arXiv Detail & Related papers (2025-07-08T17:52:44Z)
Intelligent Repetition Counting for Unseen Exercises: A Few-Shot Learning Approach with Sensor Signals [0.4998632546280975]
This study develops a method to automatically count exercise repetitions by analyzing IMU signals. We propose a repetition counting technique utilizing a deep metric-based few-shot learning approach. We show an 86.8% probability of accurately counting ten or more repetitions within a single set across 28 different exercises.
arXiv Detail & Related papers (2024-10-01T05:04:40Z)
Generalization of Fitness Exercise Recognition from Doppler Measurements by Domain-adaption and Few-Shot Learning [12.238586191793997]
In previous works, a mobile application was developed using an unmodified commercial off-the-shelf smartphone to recognize whole-body exercises. Applying such a lab-environment trained model on realistic application variations causes a significant drop in performance. This paper presents a database with controlled and uncontrolled subsets of fitness exercises.
arXiv Detail & Related papers (2023-11-20T16:40:48Z)
Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning [58.3994826169858]
We introduce RoboFuME, a reset-free fine-tuning system for robotic reinforcement learning. Our insights are to utilize offline reinforcement learning techniques to ensure efficient online fine-tuning of a pre-trained policy. Our method can incorporate data from an existing robot dataset and improve on a target task within as little as 3 hours of autonomous real-world experience.
arXiv Detail & Related papers (2023-10-23T17:50:08Z)
P\=uioio: On-device Real-Time Smartphone-Based Automated Exercise Repetition Counting System [1.4050836886292868]
We introduce a deep learning based exercise repetition counting system for smartphones consisting of five components: (1) Pose estimation, (2) Thresholding, (3) Optical flow, (4) State machine, and (5) Counter. The system is then implemented via a cross-platform mobile application named P=uioio that uses only the smartphone camera to track repetitions in real time for three standard exercises: Squats, Push-ups, and Pull-ups.
arXiv Detail & Related papers (2023-07-22T01:38:02Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
Fast and Robust Video-Based Exercise Classification via Body Pose Tracking and Scalable Multivariate Time Series Classifiers [13.561233730881279]
We present the application of classifying S&C exercises using video. We propose an approach named BodyMTS to turn video into time series by employing body pose tracking. We show that BodyMTS achieves an average accuracy of 87%, which is significantly higher than the accuracy of human domain experts.
arXiv Detail & Related papers (2022-10-02T13:03:38Z)
Muscle Vision: Real Time Keypoint Based Pose Classification of Physical Exercises [52.77024349608834]
3D human pose recognition extrapolated from video has advanced to the point of enabling real-time software applications. We propose a new machine learning pipeline and web interface that performs human pose recognition on a live video feed to detect when common exercises are performed and classify them accordingly.
arXiv Detail & Related papers (2022-03-23T00:55:07Z)
Domain Knowledge-Informed Self-Supervised Representations for Workout Form Assessment [12.040334568268445]
We propose to learn exercise-specific representations from unlabeled samples. In particular, our domain knowledge-informed self-supervised approaches exploit the harmonic motion of the exercise actions. We show that our self-supervised representations outperform off-the-shelf 2D- and 3D-pose estimators.
arXiv Detail & Related papers (2022-02-28T18:40:02Z)
Learnable Online Graph Representations for 3D Multi-Object Tracking [156.58876381318402]
We propose a unified and learning based approach to the 3D MOT problem. We employ a Neural Message Passing network for data association that is fully trainable. We show the merit of the proposed approach on the publicly available nuScenes dataset by achieving state-of-the-art performance of 65.6% AMOTA and 58% fewer ID-switches.
arXiv Detail & Related papers (2021-04-23T17:59:28Z)
Monocular Real-time Full Body Capture with Inter-part Correlations [66.22835689189237]
We present the first method for real-time full body capture that estimates shape and motion of body and hands together with a dynamic 3D face model from a single color image. Our approach uses a new neural network architecture that exploits correlations between body and hands at high computational efficiency.
arXiv Detail & Related papers (2020-12-11T02:37:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.