Deep Learning for Sports Video Event Detection: Tasks, Datasets, Methods, and Challenges
- URL: http://arxiv.org/abs/2505.03991v3
- Date: Fri, 10 Oct 2025 00:32:25 GMT
- Title: Deep Learning for Sports Video Event Detection: Tasks, Datasets, Methods, and Challenges
- Authors: Hao Xu, Arbind Agrahari Baniya, Sam Well, Mohamed Reda Bouadjenek, Richard Dazeley, Sunil Aryal,
- Abstract summary: Video event detection has become a cornerstone of modern sports analytics, powering automated performance evaluation, content generation, and tactical decision-making.<n>Recent advances in deep learning have driven progress in related tasks such as Action Spotting (AS), which identifies a representative timestamp; and Precise Event Spotting (PES), which pinpoints the exact frame of an event.
- Score: 12.534976311190748
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Video event detection has become a cornerstone of modern sports analytics, powering automated performance evaluation, content generation, and tactical decision-making. Recent advances in deep learning have driven progress in related tasks such as Temporal Action Localization (TAL), which detects extended action segments; Action Spotting (AS), which identifies a representative timestamp; and Precise Event Spotting (PES), which pinpoints the exact frame of an event. Although closely connected, their subtle differences often blur the boundaries between them, leading to confusion in both research and practical applications. Furthermore, prior surveys either address generic video event detection or broader sports video tasks, but largely overlook the unique temporal granularity and domain-specific challenges of event spotting. In addition, most existing sports video surveys focus on elite-level competitions while neglecting the wider community of everyday practitioners. This survey addresses these gaps by: (i) clearly delineating TAL, AS, and PES and their respective use cases; (ii) introducing a structured taxonomy of state of the art approaches including temporal modeling strategies, multimodal frameworks, and data-efficient pipelines tailored for AS and PES; and (iii) critically assessing benchmark datasets and evaluation protocols, highlighting limitations such as reliance on broadcast quality footage and metrics that over reward permissive multilabel predictions. By synthesizing current research and exposing open challenges, this work provides a comprehensive foundation for developing temporally precise, generalizable, and practically deployable sports event detection systems for both the research and industry communities.
Related papers
- Tracking and Segmenting Anything in Any Modality [75.32774085793498]
We propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input.<n> SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.
arXiv Detail & Related papers (2025-11-22T09:09:22Z) - Online Generic Event Boundary Detection [27.34486732049466]
We introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos.<n>This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without the access to future frames.<n>We propose a novel On-GEBD framework, inspired by Event Theory (EST) which explains how humans segment ongoing activity into events by leveraging discrepancies between predicted and actual information.
arXiv Detail & Related papers (2025-10-08T10:23:45Z) - Velocity Completion Task and Method for Event-based Player Positional Data in Soccer [0.9002260638342727]
Event-based positional data lacks continuous temporal information needed to calculate crucial properties such as velocity.<n>We propose a new method to simultaneously complete the velocity of all agents using only the event-based positional data from team sports.
arXiv Detail & Related papers (2025-05-22T04:01:49Z) - Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection [67.70328796057466]
Grounding-MD is an innovative, grounded video-language pre-training framework tailored for open-world moment detection.<n>Our framework incorporates an arbitrary number of open-ended natural language queries through a structured prompt mechanism.<n>Grounding-MD demonstrates exceptional semantic representation learning capabilities, effectively handling diverse and complex query conditions.
arXiv Detail & Related papers (2025-04-20T09:54:25Z) - OpenSTARLab: Open Approach for Spatio-Temporal Agent Data Analysis in Soccer [0.9207076627649226]
Sports analytics has become more professional and sophisticated, driven by the growing availability of detailed performance data.<n>In soccer, the effective utilization of event and tracking data is fundamental for capturing and analyzing the dynamics of the game.<n>Here we propose OpenSTARLab, an open-source framework designed to democratizetemporal agent data analysis in sports.
arXiv Detail & Related papers (2025-02-05T00:14:18Z) - Multi-Order Hyperbolic Graph Convolution and Aggregated Attention for Social Event Detection [4.183900122103969]
Social event detection (SED) is a task focused on identifying specific real-world events and has broad applications across various domains.<n>This paper introduces a novel framework, Multi-Order Hyperbolic Graph Convolution with Aggregated Attention (MOHGCAA), designed to enhance the performance of SED.
arXiv Detail & Related papers (2025-02-01T07:15:40Z) - About Time: Advances, Challenges, and Outlooks of Action Understanding [57.76390141287026]
This survey comprehensively reviews advances in uni- and multi-modal action understanding across a range of tasks.<n>We focus on prevalent challenges, overview widely adopted datasets, and survey seminal works with an emphasis on recent advances.
arXiv Detail & Related papers (2024-11-22T18:09:27Z) - WearableMil: An End-to-End Framework for Military Activity Recognition and Performance Monitoring [7.130450173185638]
This paper introduces an end-to-end framework for preprocessing, analyzing, and recognizing activities from wearable data in military training contexts.<n>We use data from 135 soldiers wearing textitGarmin--55 smartwatches over six months with over 15 million minutes.<n>Our framework addresses missing data through physiologically-informed methods, reducing unknown sleep states from 40.38% to 3.66%.
arXiv Detail & Related papers (2024-10-07T19:35:15Z) - Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - Deep learning for action spotting in association football videos [64.10841325879996]
The SoccerNet initiative organizes yearly challenges, during which participants from all around the world compete to achieve state-of-the-art performances.
This paper traces the history of action spotting in sports, from the creation of the task back in 2018, to the role it plays today in research and the sports industry.
arXiv Detail & Related papers (2024-10-02T07:56:15Z) - A Comprehensive Methodological Survey of Human Activity Recognition Across Divers Data Modalities [2.916558661202724]
Human Activity Recognition (HAR) systems aim to understand human behaviour and assign a label to each action.
HAR can leverage various data modalities, such as RGB images and video, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, and radar signals.
This paper presents a comprehensive survey of the latest advancements in HAR from 2014 to 2024.
arXiv Detail & Related papers (2024-09-15T10:04:44Z) - Harnessing Temporal Causality for Advanced Temporal Action Detection [53.654457142657236]
We introduce CausalTAD, which combines causal attention and causal Mamba to achieve state-of-the-art performance on benchmarks.
We ranked 1st in the Action Recognition, Action Detection, and Audio-Based Interaction Detection tracks at the EPIC-Kitchens Challenge 2024, and 1st in the Moment Queries track at the Ego4D Challenge 2024.
arXiv Detail & Related papers (2024-07-25T06:03:02Z) - OSL-ActionSpotting: A Unified Library for Action Spotting in Sports Videos [56.393522913188704]
We introduce OSL-ActionSpotting, a Python library that unifies different action spotting algorithms to streamline research and applications in sports video analytics.
We successfully integrated three cornerstone action spotting methods into OSL-ActionSpotting, achieving performance metrics that match those of the original, disparates.
arXiv Detail & Related papers (2024-07-01T13:17:37Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Event-based Simultaneous Localization and Mapping: A Comprehensive Survey [52.73728442921428]
Review of event-based vSLAM algorithms that exploit the benefits of asynchronous and irregular event streams for localization and mapping tasks.
Paper categorizes event-based vSLAM methods into four main categories: feature-based, direct, motion-compensation, and deep learning methods.
arXiv Detail & Related papers (2023-04-19T16:21:14Z) - Towards Active Learning for Action Spotting in Association Football
Videos [59.84375958757395]
Analyzing football videos is challenging and requires identifying subtle and diverse-temporal patterns.
Current algorithms face significant challenges when learning from limited annotated data.
We propose an active learning framework that selects the most informative video samples to be annotated next.
arXiv Detail & Related papers (2023-04-09T11:50:41Z) - Video Action Detection: Analysing Limitations and Challenges [70.01260415234127]
We analyze existing datasets on video action detection and discuss their limitations.
We perform a biasness study which analyzes a key property differentiating videos from static images: the temporal aspect.
Such extreme experiments show existence of biases which have managed to creep into existing methods inspite of careful modeling.
arXiv Detail & Related papers (2022-04-17T00:42:14Z) - Reliable Shot Identification for Complex Event Detection via
Visual-Semantic Embedding [72.9370352430965]
We propose a visual-semantic guided loss method for event detection in videos.
Motivated by curriculum learning, we introduce a negative elastic regularization term to start training the classifier with instances of high reliability.
An alternative optimization algorithm is developed to solve the proposed challenging non-net regularization problem.
arXiv Detail & Related papers (2021-10-12T11:46:56Z) - Toyota Smarthome Untrimmed: Real-World Untrimmed Videos for Activity
Detection [6.682959425576476]
We introduce a new untrimmed daily-living dataset that features several real-world challenges: Toyota Smarthome Untrimmed.
The dataset contains dense annotations including elementary, composite activities and activities involving interactions with objects.
We show that current state-of-the-art methods fail to achieve satisfactory performance on the TSU dataset.
We propose a new baseline method for activity detection to tackle the novel challenges provided by our dataset.
arXiv Detail & Related papers (2020-10-28T13:47:16Z) - ZSTAD: Zero-Shot Temporal Activity Detection [107.63759089583382]
We propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected.
We design an end-to-end deep network based on R-C3D as the architecture for this solution.
Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities.
arXiv Detail & Related papers (2020-03-12T02:40:36Z) - Unsupervised and Interpretable Domain Adaptation to Rapidly Filter
Tweets for Emergency Services [18.57009530004948]
We present a novel method to classify relevant tweets during an ongoing crisis using the publicly available dataset of TREC incident streams.
We use dedicated attention layers for each task to provide model interpretability; critical for real-word applications.
We show a practical implication of our work by providing a use-case for the COVID-19 pandemic.
arXiv Detail & Related papers (2020-03-04T06:40:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.