GTM: Gray Temporal Model for Video Recognition
- URL: http://arxiv.org/abs/2110.10348v1
- Date: Wed, 20 Oct 2021 02:45:48 GMT
- Title: GTM: Gray Temporal Model for Video Recognition
- Authors: Yanping Zhang, Yongxin Yu
- Abstract summary: We propose a new input modality: gray stream, which can skip the conversion process from video to RGB, but also improve Channel-temporal modeling ability.
We also propose a 1D Identity-wise Spatio-temporal Convolution (1D-ICSC) which captures the temporal relationship at channel-feature level within a computation budget.
- Score: 2.534039616389072
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data input modality plays an important role in video action recognition.
Normally, there are three types of input: RGB, flow stream and compressed data.
In this paper, we proposed a new input modality: gray stream. Specifically,
taken the stacked consecutive 3 gray images as input, which is the same size of
RGB, can not only skip the conversion process from video decoding data to RGB,
but also improve the spatio-temporal modeling ability at zero computation and
zero parameters. Meanwhile, we proposed a 1D Identity Channel-wise
Spatio-temporal Convolution(1D-ICSC) which captures the temporal relationship
at channel-feature level within a controllable computation budget(by parameters
G & R). Finally, we confirm its effectiveness and efficiency on several action
recognition benchmarks, such as Kinetics, Something-Something, HMDB-51 and
UCF-101, and achieve impressive results.
Related papers
- ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection [51.16181295385818]
We first collect an annotated RGB-D video SODOD (DSOD-100) dataset, which contains 100 videos within a total of 9,362 frames.
All the frames in each video are manually annotated to a high-quality saliency annotation.
We propose a new baseline model, named attentive triple-fusion network (ATF-Net) for RGB-D salient object detection.
arXiv Detail & Related papers (2024-06-18T12:09:43Z) - Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video
Recognition [43.52320791818535]
We propose a novel RGB-Event based recognition framework termed TSCFormer.
We mainly adopt the CNN as the backbone network to first encode both RGB and Event data.
It captures the global long-range relations well between both modalities and maintains the simplicity of the whole model architecture.
arXiv Detail & Related papers (2023-12-18T11:58:03Z) - Fine-Grained Action Detection with RGB and Pose Information using Two
Stream Convolutional Networks [1.4502611532302039]
We propose a two-stream network approach for the classification and detection of table tennis strokes.
Our method utilizes raw RGB data and pose information computed from MMPose toolbox.
We can report an improvement in stroke classification, reaching 87.3% of accuracy, while the detection does not outperform the baseline but still reaches an IoU of 0.349 and mAP of 0.110.
arXiv Detail & Related papers (2023-02-06T13:05:55Z) - Scalable Neural Video Representations with Learnable Positional Features [73.51591757726493]
We show how to train neural representations with learnable positional features (NVP) that effectively amortize a video as latent codes.
We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07rightarrow$34.57 (measured with the PSNR metric)
arXiv Detail & Related papers (2022-10-13T08:15:08Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Learning from Temporal Gradient for Semi-supervised Action Recognition [15.45239134477737]
We introduce temporal gradient as an additional modality for more attentive feature extraction.
Our method achieves the state-of-the-art performance on three video action recognition benchmarks.
arXiv Detail & Related papers (2021-11-25T20:30:30Z) - MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking [72.65494220685525]
We propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data.
We generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively.
To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism.
arXiv Detail & Related papers (2021-07-22T03:10:51Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z) - Motion Representation Using Residual Frames with 3D CNN [43.002621928500425]
We propose a fast but effective way to extract motion features from videos utilizing residual frames as the input data in 3D ConvNets.
By replacing traditional stacked RGB frames with residual ones, 35.6% and 26.6% points improvements over top-1 accuracy can be obtained.
arXiv Detail & Related papers (2020-06-21T07:35:41Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.