Related papers: Multi-Modal Soccer Scene Analysis with Masked Pre-Training

Multi-Modal Soccer Scene Analysis with Masked Pre-Training

URL: http://arxiv.org/abs/2512.19528v1
Date: Mon, 22 Dec 2025 16:18:45 GMT
Title: Multi-Modal Soccer Scene Analysis with Masked Pre-Training
Authors: Marc Peral, Guillem Capellera, Luis Ferraz, Antonio Rubio, Antonio Agudo,
Abstract summary: We propose a multi-modal architecture for analyzing soccer scenes from tactical camera footage.<n>Our solution integrates three distinct input modalities into a unified framework.<n>We show the effectiveness of our approach on a large-scale dataset.
Score: 16.853768247588743
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work we propose a multi-modal architecture for analyzing soccer scenes from tactical camera footage, with a focus on three core tasks: ball trajectory inference, ball state classification, and ball possessor identification. To this end, our solution integrates three distinct input modalities (player trajectories, player types and image crops of individual players) into a unified framework that processes spatial and temporal dynamics using a cascade of sociotemporal transformer blocks. Unlike prior methods, which rely heavily on accurate ball tracking or handcrafted heuristics, our approach infers the ball trajectory without direct access to its past or future positions, and robustly identifies the ball state and ball possessor under noisy or occluded conditions from real top league matches. We also introduce CropDrop, a modality-specific masking pre-training strategy that prevents over-reliance on image features and encourages the model to rely on cross-modal patterns during pre-training. We show the effectiveness of our approach on a large-scale dataset providing substantial improvements over state-of-the-art baselines in all tasks. Our results highlight the benefits of combining structured and visual cues in a transformer-based architecture, and the importance of realistic masking strategies in multi-modal learning.

Related papers

SoccerMaster: A Vision Foundation Model for Soccer Understanding [50.88251190999469]
Soccer understanding has recently garnered growing research interest due to its domain-specific complexity and unique challenges.<n>This work aims to propose a unified model to handle diverse soccer visual understanding tasks, ranging from fine-grained perception to semantic reasoning.<n>We present SoccerMaster, the first soccer-specific vision foundation model that unifies diverse understanding tasks within a single framework.
arXiv Detail & Related papers (2025-12-11T18:03:30Z)
CourtMotion: Learning Event-Driven Motion Representations from Skeletal Data for Basketball [45.88028371034407]
CourtMotion is atemporal modeling framework for analyzing and predicting game events and plays in professional basketball.<n>Our two-stage approach first processes skeletal tracking data through Graph Neural Networks to capture nuanced motion patterns.<n>We introduce event projection heads that explicitly connect player movements to basketball events like passes, shots, and steals, training the model to associate physical motion patterns with their purposes.
arXiv Detail & Related papers (2025-12-01T09:58:24Z)
FOOTPASS: A Multi-Modal Multi-Agent Tactical Context Dataset for Play-by-Play Action Spotting in Soccer Broadcast Videos [1.264619835497501]
We introduce Footovision Play-by-Play Spot Actionting in Soccer dataset (FOOTPASS)<n>It is the first benchmark for play-by-play action spotting over entire soccer matches in a multi-agent tactical context.<n>It enables the development of methods for player-centric action spotting that exploit both outputs from computer-vision tasks and prior knowledge of soccer.
arXiv Detail & Related papers (2025-11-20T09:42:28Z)
Real-time Localization of a Soccer Ball from a Single Camera [0.0]
We propose a computationally efficient method for real-time three-dimensional football trajectory reconstruction from a single broadcast camera.<n>In contrast to previous work, our approach introduces a multi-mode state model with $W$ discrete modes to significantly accelerate optimization.<n>The system operates on standard CPUs and achieves low latency suitable for live broadcast settings.
arXiv Detail & Related papers (2025-06-09T17:52:07Z)
Sports-Traj: A Unified Trajectory Generation Model for Multi-Agent Movement in Sports [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.<n>Specifically, we introduce a Ghost Spatial Masking (GSM) module, embedded within a Transformer encoder, for spatial feature extraction.<n>We benchmark three practical sports datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z)
Ball Trajectory Inference from Multi-Agent Sports Contexts Using Set Transformer and Hierarchical Bi-LSTM [18.884300680050316]
This paper proposes an inference framework of ball trajectory from player trajectories as a cost-efficient alternative to ball tracking. The experimental results show that our model provides natural and accurate trajectories as well as admissible player ball possession at the same time. We suggest several practical applications of our framework including missing trajectory imputation, semi-automated pass annotation, automated zoom-in for match broadcasting, and calculating possession-wise running performance metrics.
arXiv Detail & Related papers (2023-06-14T02:19:59Z)
Unifying Flow, Stereo and Depth Estimation [121.54066319299261]
We present a unified formulation and model for three motion and 3D perception tasks. We formulate all three tasks as a unified dense correspondence matching problem. Our model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks.
arXiv Detail & Related papers (2022-11-10T18:59:54Z)
SoccerNet-Tracking: Multiple Object Tracking Dataset and Benchmark in Soccer Videos [62.686484228479095]
We propose a novel dataset for multiple object tracking composed of 200 sequences of 30s each. The dataset is fully annotated with bounding boxes and tracklet IDs. Our analysis shows that multiple player, referee and ball tracking in soccer videos is far from being solved.
arXiv Detail & Related papers (2022-04-14T12:22:12Z)
End-to-end Multi-modal Video Temporal Grounding [105.36814858748285]
We propose a multi-modal framework to extract complementary information from videos. We adopt RGB images for appearance, optical flow for motion, and depth maps for image structure. We conduct experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-12T17:58:10Z)
Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field. It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations. Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.