Fine-Grained Action Detection with RGB and Pose Information using Two
Stream Convolutional Networks
- URL: http://arxiv.org/abs/2302.02755v1
- Date: Mon, 6 Feb 2023 13:05:55 GMT
- Title: Fine-Grained Action Detection with RGB and Pose Information using Two
Stream Convolutional Networks
- Authors: Leonard Hacker and Finn Bartels and Pierre-Etienne Martin
- Abstract summary: We propose a two-stream network approach for the classification and detection of table tennis strokes.
Our method utilizes raw RGB data and pose information computed from MMPose toolbox.
We can report an improvement in stroke classification, reaching 87.3% of accuracy, while the detection does not outperform the baseline but still reaches an IoU of 0.349 and mAP of 0.110.
- Score: 1.4502611532302039
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As participants of the MediaEval 2022 Sport Task, we propose a two-stream
network approach for the classification and detection of table tennis strokes.
Each stream is a succession of 3D Convolutional Neural Network (CNN) blocks
using attention mechanisms. Each stream processes different 4D inputs. Our
method utilizes raw RGB data and pose information computed from MMPose toolbox.
The pose information is treated as an image by applying the pose either on a
black background or on the original RGB frame it has been computed from. Best
performance is obtained by feeding raw RGB data to one stream, Pose + RGB
(PRGB) information to the other stream and applying late fusion on the
features. The approaches were evaluated on the provided TTStroke-21 data sets.
We can report an improvement in stroke classification, reaching 87.3% of
accuracy, while the detection does not outperform the baseline but still
reaches an IoU of 0.349 and mAP of 0.110.
Related papers
- ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection [51.16181295385818]
We first collect an annotated RGB-D video SODOD (DSOD-100) dataset, which contains 100 videos within a total of 9,362 frames.
All the frames in each video are manually annotated to a high-quality saliency annotation.
We propose a new baseline model, named attentive triple-fusion network (ATF-Net) for RGB-D salient object detection.
arXiv Detail & Related papers (2024-06-18T12:09:43Z) - NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows [60.291277312569285]
We present a method for automatically modifying a NeRF representation based on a single observation.
Our method defines the transformation as a 3D flow, specifically as a weighted linear blending of rigid transformations.
We also introduce a new dataset for exploring the problem of modifying a NeRF scene through a single observation.
arXiv Detail & Related papers (2024-06-15T07:58:08Z) - SSTFormer: Bridging Spiking Neural Network and Memory Support
Transformer for Frame-Event based Recognition [42.118434116034194]
We propose to recognize patterns by fusing RGB frames and event streams simultaneously.
Due to the scarce of RGB-Event based classification dataset, we also propose a large-scale PokerEvent dataset.
arXiv Detail & Related papers (2023-08-08T16:15:35Z) - Attentive Multimodal Fusion for Optical and Scene Flow [24.08052492109655]
Existing methods typically rely solely on RGB images or fuse the modalities at later stages.
We propose a novel deep neural network approach named FusionRAFT, which enables early-stage information fusion between sensor modalities.
Our approach exhibits improved robustness in the presence of noise and low-lighting conditions that affect the RGB images.
arXiv Detail & Related papers (2023-07-28T04:36:07Z) - Detecting Humans in RGB-D Data with CNNs [14.283154024458739]
We propose a novel fusion approach based on the characteristics of depth images.
We also present a new depth-encoding scheme, which not only encodes depth images into three channels but also enhances the information for classification.
arXiv Detail & Related papers (2022-07-17T03:17:09Z) - TAFNet: A Three-Stream Adaptive Fusion Network for RGB-T Crowd Counting [16.336401175470197]
We propose a three-stream adaptive fusion network named TAFNet, which uses paired RGB and thermal images for crowd counting.
Experiment results on RGBT-CC dataset show that our method achieves more than 20% improvement on mean average error.
arXiv Detail & Related papers (2022-02-17T08:43:10Z) - Boosting RGB-D Saliency Detection by Leveraging Unlabeled RGB Images [89.81919625224103]
Training deep models for RGB-D salient object detection (SOD) often requires a large number of labeled RGB-D images.
We present a Dual-Semi RGB-D Salient Object Detection Network (DS-Net) to leverage unlabeled RGB images for boosting RGB-D saliency detection.
arXiv Detail & Related papers (2022-01-01T03:02:27Z) - VPFNet: Improving 3D Object Detection with Virtual Point based LiDAR and
Stereo Data Fusion [62.24001258298076]
VPFNet is a new architecture that cleverly aligns and aggregates the point cloud and image data at the virtual' points.
Our VPFNet achieves 83.21% moderate 3D AP and 91.86% moderate BEV AP on the KITTI test set, ranking the 1st since May 21th, 2021.
arXiv Detail & Related papers (2021-11-29T08:51:20Z) - GTM: Gray Temporal Model for Video Recognition [2.534039616389072]
We propose a new input modality: gray stream, which can skip the conversion process from video to RGB, but also improve Channel-temporal modeling ability.
We also propose a 1D Identity-wise Spatio-temporal Convolution (1D-ICSC) which captures the temporal relationship at channel-feature level within a computation budget.
arXiv Detail & Related papers (2021-10-20T02:45:48Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.