An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video
- URL: http://arxiv.org/abs/2404.06741v4
- Date: Fri, 11 Oct 2024 11:44:49 GMT
- Title: An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video
- Authors: Xingyu Song, Zhan Li, Shi Chen, Xin-Qiang Cai, Kazuyuki Demachi,
- Abstract summary: Action recognition, an essential component of computer vision, plays a pivotal role in multiple applications.
CNNs suffer performance declines when trained with discontinuous video frames, which is a frequent scenario in real-world settings.
To overcome this issue, we introduce the 4A pipeline, which employs a series of sophisticated techniques.
- Score: 11.293897932762809
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Action recognition, an essential component of computer vision, plays a pivotal role in multiple applications. Despite significant improvements brought by Convolutional Neural Networks (CNNs), these models suffer performance declines when trained with discontinuous video frames, which is a frequent scenario in real-world settings. This decline primarily results from the loss of temporal continuity, which is crucial for understanding the semantics of human actions. To overcome this issue, we introduce the 4A (Action Animation-based Augmentation Approach) pipeline, which employs a series of sophisticated techniques: starting with 2D human pose estimation from RGB videos, followed by Quaternion-based Graph Convolution Network for joint orientation and trajectory prediction, and Dynamic Skeletal Interpolation for creating smoother, diversified actions using game engine technology. This innovative approach generates realistic animations in varied game environments, viewed from multiple viewpoints. In this way, our method effectively bridges the domain gap between virtual and real-world data. In experimental evaluations, the 4A pipeline achieves comparable or even superior performance to traditional training approaches using real-world data, while requiring only 10% of the original data volume. Additionally, our approach demonstrates enhanced performance on In-the-wild videos, marking a significant advancement in the field of action recognition.
Related papers
- Masked Modeling for Human Motion Recovery Under Occlusions [21.05382087890133]
MoRo is an end-to-end generative framework that formulates motion reconstruction as a video-conditioned task.<n>MoRo achieves real-time inference at 70 FPS on a single H200 GPU.
arXiv Detail & Related papers (2026-01-22T16:22:20Z) - mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs [5.109732854501585]
We introduce mimic-video, a novel Video-Action Model (VAM) that pairs a pretrained Internet-scale video model with a flow matching-based action decoder conditioned on its latent representations.<n>Our approach achieves state-of-the-art performance on simulated and real-world robotic manipulation tasks, improving sample efficiency by 10x and convergence speed by 2x compared to traditional VLA architectures.
arXiv Detail & Related papers (2025-12-17T18:47:31Z) - DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation [52.136378691610524]
We present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features.<n>By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation.<n>We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
arXiv Detail & Related papers (2025-10-28T10:17:11Z) - EMMA: Generalizing Real-World Robot Manipulation via Generative Visual Transfer [35.27100635173712]
Vision-language-action (VLA) models increasingly rely on diverse training data to achieve robust generalization.<n>We propose Embodied Manipulation Media Adaptation (EMMA), a VLA policy enhancement framework that integrates a generative data engine with an effective training pipeline.<n>DreamTransfer enables text-controlled visual editing of robot videos, transforming foreground, background, and lighting conditions without compromising 3D structure or geometrical plausibility.<n>AdaMix is a hard-sample-aware training strategy that dynamically reweights training batches to focus optimization on perceptually or kinematically challenging samples.
arXiv Detail & Related papers (2025-09-26T14:34:44Z) - Precise Action-to-Video Generation Through Visual Action Prompts [62.951609704196485]
Action-driven video generation faces a precision-generality trade-off.<n>Agent-centric action signals provide precision at the cost of cross-domain transferability.<n>We "render" actions into precise visual prompts as domain-agnostic representations.
arXiv Detail & Related papers (2025-08-18T17:12:28Z) - Puppeteer: Rig and Animate Your 3D Models [105.11046762553121]
Puppeteer is a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects.<n>Our system first predicts plausible skeletal structures via an auto-regressive transformer.<n>It then infers skinning weights via an attention-based architecture.
arXiv Detail & Related papers (2025-08-14T17:59:31Z) - 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration [31.111439909825627]
Existing methods typically model the dataset's action distribution using simple observations as inputs.<n>We propose 4D-VLA, a novel approach that effectively integrates 4D information into the input to these sources of chaos.<n>Our model consistently outperforms existing methods, demonstrating stronger spatial understanding and adaptability.
arXiv Detail & Related papers (2025-06-27T14:09:29Z) - An Efficient 3D Convolutional Neural Network with Channel-wise, Spatial-grouped, and Temporal Convolutions [3.798710743290466]
We introduce a simple and very efficient 3D convolutional neural network for video action recognition.
We evaluate the performance and efficiency of our proposed network on several video action recognition datasets.
arXiv Detail & Related papers (2025-03-02T08:47:06Z) - Online hand gesture recognition using Continual Graph Transformers [1.3927943269211591]
We propose a novel online recognition system designed for real-time skeleton sequence streaming.
Our approach achieves state-of-the-art accuracy and significantly reduces false positive rates, making it a compelling solution for real-time applications.
The proposed system can be seamlessly integrated into various domains, including human-robot collaboration and assistive technologies.
arXiv Detail & Related papers (2025-02-20T17:27:55Z) - Diff-IP2D: Diffusion-Based Hand-Object Interaction Prediction on Egocentric Videos [22.81433371521832]
We propose Diff-IP2D to forecast future hand trajectories and object affordances concurrently in an iterative non-autoregressive manner.
Our method significantly outperforms the state-of-the-art baselines on both the off-the-shelf metrics and our newly proposed evaluation protocol.
arXiv Detail & Related papers (2024-05-07T14:51:05Z) - Video Action Recognition Collaborative Learning with Dynamics via
PSO-ConvNet Transformer [1.876462046907555]
We propose a novel PSO-ConvNet model for learning actions in videos.
Our experimental results on the UCF-101 dataset demonstrate substantial improvements of up to 9% in accuracy.
Overall, our dynamic PSO-ConvNet model provides a promising direction for improving Human Action Recognition.
arXiv Detail & Related papers (2023-02-17T23:39:34Z) - Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth
Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly.
Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation.
Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Domain Adaptive Robotic Gesture Recognition with Unsupervised
Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot.
It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture.
Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z) - Complex Human Action Recognition in Live Videos Using Hybrid FR-DL
Method [1.027974860479791]
We address challenges of the preprocessing phase, by an automated selection of representative frames among the input sequences.
We propose a hybrid technique using background subtraction and HOG, followed by application of a deep neural network and skeletal modelling method.
We name our model as Feature Reduction & Deep Learning based action recognition method, or FR-DL in short.
arXiv Detail & Related papers (2020-07-06T15:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.