BABEL: Bodies, Action and Behavior with English Labels
- URL: http://arxiv.org/abs/2106.09696v1
- Date: Thu, 17 Jun 2021 17:51:14 GMT
- Title: BABEL: Bodies, Action and Behavior with English Labels
- Authors: Abhinanda R. Punnakkal (1), Arjun Chandrasekaran (1), Nikos Athanasiou
(1), Alejandra Quiros-Ramirez (2), Michael J. Black (1) ((1) Max Planck
Institute for Intelligent Systems, (2) Universitat Konstanz)
- Abstract summary: We present BABEL, a large dataset with language labels describing the actions being performed in mocap sequences.
There are over 28k sequence labels, and 63k frame labels in BABEL, which belong to over 250 unique action categories.
We demonstrate the value of BABEL as a benchmark, and evaluate the performance of models on 3D action recognition.
- Score: 53.83774092560076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the semantics of human movement -- the what, how and why of the
movement -- is an important problem that requires datasets of human actions
with semantic labels. Existing datasets take one of two approaches. Large-scale
video datasets contain many action labels but do not contain ground-truth 3D
human motion. Alternatively, motion-capture (mocap) datasets have precise body
motions but are limited to a small number of actions. To address this, we
present BABEL, a large dataset with language labels describing the actions
being performed in mocap sequences. BABEL consists of action labels for about
43 hours of mocap sequences from AMASS. Action labels are at two levels of
abstraction -- sequence labels describe the overall action in the sequence, and
frame labels describe all actions in every frame of the sequence. Each frame
label is precisely aligned with the duration of the corresponding action in the
mocap sequence, and multiple actions can overlap. There are over 28k sequence
labels, and 63k frame labels in BABEL, which belong to over 250 unique action
categories. Labels from BABEL can be leveraged for tasks like action
recognition, temporal action localization, motion synthesis, etc. To
demonstrate the value of BABEL as a benchmark, we evaluate the performance of
models on 3D action recognition. We demonstrate that BABEL poses interesting
learning challenges that are applicable to real-world scenarios, and can serve
as a useful benchmark of progress in 3D action recognition. The dataset,
baseline method, and evaluation code is made available, and supported for
academic research purposes at https://babel.is.tue.mpg.de/.
Related papers
- Bayesian-guided Label Mapping for Visual Reprogramming [20.27639343292564]
One-to-one mappings may overlook the complex relationship between pretrained and downstream labels.
Motivated by this observation, we propose a Bayesian-guided Label Mapping (BLM) method.
Experiments conducted on both pretrained vision models (e.g., ResNeXt) and vision-language models (e.g., CLIP) demonstrate the superior performance of BLM over existing label mapping methods.
arXiv Detail & Related papers (2024-10-31T15:20:43Z) - MixSup: Mixed-grained Supervision for Label-efficient LiDAR-based 3D
Object Detection [59.1417156002086]
MixSup is a more practical paradigm simultaneously utilizing massive cheap coarse labels and a limited number of accurate labels for Mixed-grained Supervision.
MixSup achieves up to 97.31% of fully supervised performance, using cheap cluster annotations and only 10% box annotations.
arXiv Detail & Related papers (2024-01-29T17:05:19Z) - LABELMAKER: Automatic Semantic Label Generation from RGB-D Trajectories [59.14011485494713]
This work introduces a fully automated 2D/3D labeling framework that can generate labels for RGB-D scans at equal (or better) level of accuracy.
We demonstrate the effectiveness of our LabelMaker pipeline by generating significantly better labels for the ScanNet datasets and automatically labelling the previously unlabeled ARKitScenes dataset.
arXiv Detail & Related papers (2023-11-20T20:40:24Z) - Unleashing the Power of Shared Label Structures for Human Activity
Recognition [36.66107380956779]
We propose SHARE, a framework that takes into account shared structures of label names for different activities.
To exploit the shared structures, SHARE comprises an encoder for extracting features from input sensory time series and a decoder for generating label names as a token sequence.
We also propose three label augmentation techniques to help the model more effectively capture semantic structures across activities.
arXiv Detail & Related papers (2023-01-01T22:50:08Z) - An Action Is Worth Multiple Words: Handling Ambiguity in Action
Recognition [18.937012620464465]
We address the challenge of training multi-label action recognition models from only single positive training labels.
We propose two approaches that are based on generating pseudo training examples sampled from similar instances within the train set.
We create a new evaluation benchmark by manually annotating a subset of EPIC-Kitchens-100's validation set with multiple verb labels.
arXiv Detail & Related papers (2022-10-10T18:06:43Z) - TEACH: Temporal Action Composition for 3D Humans [50.97135662063117]
Given a series of natural language descriptions, our task is to generate 3D human motions that correspond semantically to the text.
In particular, our goal is to enable the synthesis of a series of actions, which we refer to as temporal action composition.
arXiv Detail & Related papers (2022-09-09T00:33:40Z) - FineGym: A Hierarchical Video Dataset for Fine-grained Action
Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos.
It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy.
This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z) - Take an Emotion Walk: Perceiving Emotions from Gaits Using Hierarchical Attention Pooling and Affective Mapping [55.72376663488104]
We present an autoencoder-based approach to classify perceived human emotions from walking styles obtained from videos or motion-captured data.
Given the motion on each joint in the pose at each time step extracted from 3D pose sequences, we hierarchically pool these joint motions in the encoder.
We train the decoder to reconstruct the motions per joint per time step in a top-down manner from the latent embeddings.
arXiv Detail & Related papers (2019-11-20T05:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.