TrajSV: A Trajectory-based Model for Sports Video Representations and Applications
- URL: http://arxiv.org/abs/2508.11569v1
- Date: Fri, 15 Aug 2025 16:23:36 GMT
- Title: TrajSV: A Trajectory-based Model for Sports Video Representations and Applications
- Authors: Zheng Wang, Shihao Xu, Wei Shi,
- Abstract summary: We present TrajSV, a trajectory-based framework that addresses various issues in existing studies.<n>TrajSV comprises three components: data preprocessing, Clip Representation Network (CRNet), and Video Representation Network (VRNet)
- Score: 8.54429529591851
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Sports analytics has received significant attention from both academia and industry in recent years. Despite the growing interest and efforts in this field, several issues remain unresolved, including (1) data unavailability, (2) lack of an effective trajectory-based framework, and (3) requirement for sufficient supervision labels. In this paper, we present TrajSV, a trajectory-based framework that addresses various issues in existing studies. TrajSV comprises three components: data preprocessing, Clip Representation Network (CRNet), and Video Representation Network (VRNet). The data preprocessing module extracts player and ball trajectories from sports broadcast videos. CRNet utilizes a trajectory-enhanced Transformer module to learn clip representations based on these trajectories. Additionally, VRNet learns video representations by aggregating clip representations and visual features with an encoder-decoder architecture. Finally, a triple contrastive loss is introduced to optimize both video and clip representations in an unsupervised manner. The experiments are conducted on three broadcast video datasets to verify the effectiveness of TrajSV for three types of sports (i.e., soccer, basketball, and volleyball) with three downstream applications (i.e., sports video retrieval, action spotting, and video captioning). The results demonstrate that TrajSV achieves state-of-the-art performance in sports video retrieval, showcasing a nearly 70% improvement. It outperforms baselines in action spotting, achieving state-of-the-art results in 9 out of 17 action categories, and demonstrates a nearly 20% improvement in video captioning. Additionally, we introduce a deployed system along with the three applications based on TrajSV.
Related papers
- Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval [53.54695034420311]
In practice, videos are typically untrimmed in long durations with much more complicated background content.<n>We propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model.<n>Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets.
arXiv Detail & Related papers (2025-10-14T08:38:20Z) - Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset.
We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Sports Video Analysis on Large-Scale Data [10.24207108909385]
This paper investigates the modeling of automated machine description on sports video.
We propose a novel large-scale NBA dataset for Sports Video Analysis (NSVA) with a focus on captioning.
arXiv Detail & Related papers (2022-08-09T16:59:24Z) - MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization [65.09758931804478]
Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs.
A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
arXiv Detail & Related papers (2022-03-14T13:15:09Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z) - Boosting Video Representation Learning with Multi-Faceted Integration [112.66127428372089]
Video content is multifaceted, consisting of objects, scenes, interactions or actions.
Existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset.
We propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content.
arXiv Detail & Related papers (2022-01-11T16:14:23Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Few-Shot Video Object Detection [70.43402912344327]
We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions.
FSVOD-500 comprises of 500 classes with class-balanced videos in each category for few-shot learning.
Our TPN and TMN+ are jointly and end-to-end trained.
arXiv Detail & Related papers (2021-04-30T07:38:04Z) - Hybrid Dynamic-static Context-aware Attention Network for Action
Assessment in Long Videos [96.45804577283563]
We present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos.
We learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames.
We combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts.
arXiv Detail & Related papers (2020-08-13T15:51:42Z) - Event detection in coarsely annotated sports videos via parallel multi
receptive field 1D convolutions [14.30009544149561]
In problems such as sports video analytics, it is difficult to obtain accurate frame level annotations and exact event duration.
We propose the task of event detection in coarsely annotated videos.
We introduce a multi-tower temporal convolutional network architecture for the proposed task.
arXiv Detail & Related papers (2020-04-13T19:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.