Temporally Precise Action Spotting in Soccer Videos Using Dense
Detection Anchors
- URL: http://arxiv.org/abs/2205.10450v1
- Date: Fri, 20 May 2022 22:14:02 GMT
- Title: Temporally Precise Action Spotting in Soccer Videos Using Dense
Detection Anchors
- Authors: Jo\~ao V. B. Soares, Avijit Shah, Topojoy Biswas
- Abstract summary: We present a model for temporally precise action spotting in videos, which uses a dense set of detection anchors, predicting a detection confidence and corresponding fine-grained temporal displacement for each anchor.
We achieve a new state-of-the-art on SoccerNet-v2, the largest soccer video dataset of its kind, with marked improvements in temporal localization.
- Score: 1.6114012813668934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a model for temporally precise action spotting in videos, which
uses a dense set of detection anchors, predicting a detection confidence and
corresponding fine-grained temporal displacement for each anchor. We experiment
with two trunk architectures, both of which are able to incorporate large
temporal contexts while preserving the smaller-scale features required for
precise localization: a one-dimensional version of a u-net, and a Transformer
encoder (TE). We also suggest best practices for training models of this kind,
by applying Sharpness-Aware Minimization (SAM) and mixup data augmentation. We
achieve a new state-of-the-art on SoccerNet-v2, the largest soccer video
dataset of its kind, with marked improvements in temporal localization.
Additionally, our ablations show: the importance of predicting the temporal
displacements; the trade-offs between the u-net and TE trunks; and the benefits
of training with SAM and mixup.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - Multi-step Temporal Modeling for UAV Tracking [14.687636301587045]
We introduce MT-Track, a streamlined and efficient multi-step temporal modeling framework for enhanced UAV tracking.
We unveil a unique temporal correlation module that dynamically assesses the interplay between the template and search region features.
We propose a mutual transformer module to refine the correlation maps of historical and current frames by modeling the temporal knowledge in the tracking sequence.
arXiv Detail & Related papers (2024-03-07T09:48:13Z) - Distillation Enhanced Time Series Forecasting Network with Momentum Contrastive Learning [7.4106801792345705]
We propose DE-TSMCL, an innovative distillation enhanced framework for long sequence time series forecasting.
Specifically, we design a learnable data augmentation mechanism which adaptively learns whether to mask a timestamp.
Then, we propose a contrastive learning task with momentum update to explore inter-sample and intra-temporal correlations of time series.
By developing model loss from multiple tasks, we can learn effective representations for downstream forecasting task.
arXiv Detail & Related papers (2024-01-31T12:52:10Z) - Unsupervised Continual Semantic Adaptation through Neural Rendering [32.099350613956716]
We study continual multi-scene adaptation for the task of semantic segmentation.
We propose training a Semantic-NeRF network for each scene by fusing the predictions of a segmentation model.
We evaluate our approach on ScanNet, where we outperform both a voxel-based baseline and a state-of-the-art unsupervised domain adaptation method.
arXiv Detail & Related papers (2022-11-25T09:31:41Z) - Spotting Temporally Precise, Fine-Grained Events in Video [23.731838969934206]
We introduce the task of spotting temporally precise, fine-grained events in video.
Models must reason globally about the full-time scale of actions and locally to identify subtle frame-to-frame appearance and motion differences.
We propose E2E-Spot, a compact, end-to-end model that performs well on the precise spotting task and can be trained quickly on a single GPU.
arXiv Detail & Related papers (2022-07-20T22:15:07Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z) - RMS-Net: Regression and Masking for Soccer Event Spotting [52.742046866220484]
We devise a lightweight and modular network for action spotting, which can simultaneously predict the event label and its temporal offset.
When tested on the SoccerNet dataset and using standard features, our full proposal exceeds the current state of the art by 3 Average-mAP points.
arXiv Detail & Related papers (2021-02-15T16:04:18Z) - DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.