Three-Stream 3D/1D CNN for Fine-Grained Action Classification and
Segmentation in Table Tennis
- URL: http://arxiv.org/abs/2109.14306v1
- Date: Wed, 29 Sep 2021 09:43:21 GMT
- Title: Three-Stream 3D/1D CNN for Fine-Grained Action Classification and
Segmentation in Table Tennis
- Authors: Pierre-Etienne Martin (MPI-EVA), Jenny Benois-Pineau (UB), Renaud
P\'eteri (MIA), Julien Morlier (UB)
- Abstract summary: It is applied to TT-21 dataset which consists of untrimmed videos of table tennis games.
The goal is to detect and classify table tennis strokes in the videos, the first step of a bigger scheme.
The pose is also investigated in order to offer richer feedback to the athletes.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a fusion method of modalities extracted from video
through a three-stream network with spatio-temporal and temporal convolutions
for fine-grained action classification in sport. It is applied to TTStroke-21
dataset which consists of untrimmed videos of table tennis games. The goal is
to detect and classify table tennis strokes in the videos, the first step of a
bigger scheme aiming at giving feedback to the players for improving their
performance. The three modalities are raw RGB data, the computed optical flow
and the estimated pose of the player. The network consists of three branches
with attention blocks. Features are fused at the latest stage of the network
using bilinear layers. Compared to previous approaches, the use of three
modalities allows faster convergence and better performances on both tasks:
classification of strokes with known temporal boundaries and joint segmentation
and classification. The pose is also further investigated in order to offer
richer feedback to the athletes.
Related papers
- Semi-supervised 3D Video Information Retrieval with Deep Neural Network
and Bi-directional Dynamic-time Warping Algorithm [14.39527406033429]
The proposed algorithm is designed to handle large video datasets and retrieve the most related videos to a given inquiry video clip.
We split both the candidate and the inquiry videos into a sequence of clips and convert each clip to a representation vector using an autoencoder-backed deep neural network.
We then calculate a similarity measure between the sequences of embedding vectors using a bi-directional dynamic time-warping method.
arXiv Detail & Related papers (2023-09-03T03:10:18Z) - SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and
Quasi-Planar Segmentation [53.83313235792596]
We present a new methodology for real-time semantic mapping from RGB-D sequences.
It combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping.
Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems.
arXiv Detail & Related papers (2023-06-28T22:36:44Z) - Towards Active Learning for Action Spotting in Association Football
Videos [59.84375958757395]
Analyzing football videos is challenging and requires identifying subtle and diverse-temporal patterns.
Current algorithms face significant challenges when learning from limited annotated data.
We propose an active learning framework that selects the most informative video samples to be annotated next.
arXiv Detail & Related papers (2023-04-09T11:50:41Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z) - 3D Convolutional Networks for Action Recognition: Application to Sport
Gesture Recognition [0.0]
We are interested in the classification of continuous video takes with repeatable actions, such as strokes of table tennis.
The 3D convnets are an efficient tool for solving these problems with window-based approaches.
arXiv Detail & Related papers (2022-04-13T13:21:07Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Weakly-Supervised Action Localization and Action Recognition using
Global-Local Attention of 3D CNN [4.924442315857227]
3D Convolutional Neural Network (3D CNN) captures spatial and temporal information on 3D data such as video sequences.
We propose two approaches to improve the visual explanations and classification in 3D CNN.
arXiv Detail & Related papers (2020-12-17T12:29:16Z) - 3D attention mechanism for fine-grained classification of table tennis
strokes using a Twin Spatio-Temporal Convolutional Neural Networks [1.181206257787103]
The paper addresses the problem of recognition of actions in video with low inter-class variability such as Table Tennis strokes.
Two stream, "twin" convolutional neural networks are used with 3D convolutions both on RGB data and optical flow.
We introduce 3D attention modules and examine their impact on classification efficiency.
arXiv Detail & Related papers (2020-11-20T09:55:12Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z) - Dynamic Inference: A New Approach Toward Efficient Video Action
Recognition [69.9658249941149]
Action recognition in videos has achieved great success recently, but it remains a challenging task due to the massive computational cost.
We propose a general dynamic inference idea to improve inference efficiency by leveraging the variation in the distinguishability of different videos.
arXiv Detail & Related papers (2020-02-09T11:09:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.