CLIP-ReIdent: Contrastive Training for Player Re-Identification
- URL: http://arxiv.org/abs/2303.11855v1
- Date: Tue, 21 Mar 2023 13:55:27 GMT
- Title: CLIP-ReIdent: Contrastive Training for Player Re-Identification
- Authors: Konrad Habel, Fabian Deuser, Norbert Oswald
- Abstract summary: We investigate whether it is possible to transfer the out-standing zero-shot performance of pre-trained CLIP models to the domain of player re-identification.
Unlike previous work, our approach is entirely class-agnostic and benefits from large-scale pre-training.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sports analytics benefits from recent advances in machine learning providing
a competitive advantage for teams or individuals. One important task in this
context is the performance measurement of individual players to provide reports
and log files for subsequent analysis. During sport events like basketball,
this involves the re-identification of players during a match either from
multiple camera viewpoints or from a single camera viewpoint at different
times. In this work, we investigate whether it is possible to transfer the
out-standing zero-shot performance of pre-trained CLIP models to the domain of
player re-identification. For this purpose we reformulate the contrastive
language-to-image pre-training approach from CLIP to a contrastive
image-to-image training approach using the InfoNCE loss as training objective.
Unlike previous work, our approach is entirely class-agnostic and benefits from
large-scale pre-training. With a fine-tuned CLIP ViT-L/14 model we achieve
98.44 % mAP on the MMSports 2022 Player Re-Identification challenge.
Furthermore we show that the CLIP Vision Transformers have already strong OCR
capabilities to identify useful player features like shirt numbers in a
zero-shot manner without any fine-tuning on the dataset. By applying the
Score-CAM algorithm we visualise the most important image regions that our
fine-tuned model identifies when calculating the similarity score between two
images of a player.
Related papers
- Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - Masked Autoencoding Does Not Help Natural Language Supervision at Scale [16.277390808400828]
We investigate whether a similar approach can be effective when trained with a much larger amount of data.
We find that a combination of two state of the art approaches: masked auto-encoders, MAE and contrastive language image pre-training, CLIP provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs.
arXiv Detail & Related papers (2023-01-19T01:05:18Z) - A Graph-Based Method for Soccer Action Spotting Using Unsupervised
Player Classification [75.93186954061943]
Action spotting involves understanding the dynamics of the game, the complexity of events, and the variation of video sequences.
In this work, we focus on the former by (a) identifying and representing the players, referees, and goalkeepers as nodes in a graph, and by (b) modeling their temporal interactions as sequences of graphs.
For the player identification task, our method obtains an overall performance of 57.83% average-mAP by combining it with other modalities.
arXiv Detail & Related papers (2022-11-22T15:23:53Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Sports Re-ID: Improving Re-Identification Of Players In Broadcast Videos
Of Team Sports [0.0]
This work focuses on player re-identification in broadcast videos of team sports.
Specifically, we focus on identifying the same player in images captured from different camera viewpoints during any given moment of a match.
arXiv Detail & Related papers (2022-06-06T06:06:23Z) - Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training.
CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens.
After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z) - Unsupervised Visual Representation Learning by Tracking Patches in Video [88.56860674483752]
We propose to use tracking as a proxy task for a computer vision system to learn the visual representations.
Modelled on the Catch game played by the children, we design a Catch-the-Patch (CtP) game for a 3D-CNN model to learn visual representations.
arXiv Detail & Related papers (2021-05-06T09:46:42Z) - Unsupervised Temporal Feature Aggregation for Event Detection in
Unstructured Sports Videos [10.230408415438966]
We study the case of event detection in sports videos for unstructured environments with arbitrary camera angles.
We identify and solve two major problems: unsupervised identification of players in an unstructured setting and generalization of the trained models to pose variations due to arbitrary shooting angles.
arXiv Detail & Related papers (2020-02-19T10:24:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.