Automatic Generation of Labeled Data for Video-Based Human Pose Analysis
via NLP applied to YouTube Subtitles
- URL: http://arxiv.org/abs/2304.14489v2
- Date: Tue, 2 May 2023 08:54:06 GMT
- Title: Automatic Generation of Labeled Data for Video-Based Human Pose Analysis
via NLP applied to YouTube Subtitles
- Authors: Sebastian Dill, Susi Zhihan, Maurice Rohr, Maziar Sharbafi, Christoph
Hoog Antink
- Abstract summary: We propose a method that makes use of the abundance of fitness videos available online.
We utilize the advantage that videos often not only show the exercises, but also provide language as an additional source of information.
- Score: 2.039924457892648
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: With recent advancements in computer vision as well as machine learning (ML),
video-based at-home exercise evaluation systems have become a popular topic of
current research. However, performance depends heavily on the amount of
available training data. Since labeled datasets specific to exercising are
rare, we propose a method that makes use of the abundance of fitness videos
available online. Specifically, we utilize the advantage that videos often not
only show the exercises, but also provide language as an additional source of
information. With push-ups as an example, we show that through the analysis of
subtitle data using natural language processing (NLP), it is possible to create
a labeled (irrelevant, relevant correct, relevant incorrect) dataset containing
relevant information for pose analysis. In particular, we show that irrelevant
clips ($n=332$) have significantly different joint visibility values compared
to relevant clips ($n=298$). Inspecting cluster centroids also show different
poses for the different classes.
Related papers
- Less than Few: Self-Shot Video Instance Segmentation [50.637278655763616]
We propose to automatically learn to find appropriate support videos given a query.
We tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting.
We provide strong baseline performances that utilize a novel transformer-based model.
arXiv Detail & Related papers (2022-04-19T13:14:43Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - CUPID: Adaptive Curation of Pre-training Data for Video-and-Language
Representation Learning [49.18591896085498]
We propose CUPID to bridge the domain gap between source and target data.
CUPID yields new state-of-the-art performance across multiple video-language and video tasks.
arXiv Detail & Related papers (2021-04-01T06:42:16Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z) - Self-supervised Video Representation Learning Using Inter-intra
Contrastive Framework [43.002621928500425]
We propose a self-supervised method to learn feature representations from videos.
Because video representation is important, we extend negative samples by introducing intra-negative samples.
We conduct experiments on video retrieval and video recognition tasks using the learned video representation.
arXiv Detail & Related papers (2020-08-06T09:08:14Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.