Distantly Supervised Semantic Text Detection and Recognition for
Broadcast Sports Videos Understanding
- URL: http://arxiv.org/abs/2111.00629v1
- Date: Sun, 31 Oct 2021 23:59:29 GMT
- Title: Distantly Supervised Semantic Text Detection and Recognition for
Broadcast Sports Videos Understanding
- Authors: Avijit Shah, Topojoy Biswas, Sathish Ramadoss, Deven Santosh Shah
- Abstract summary: We study extremely accurate semantic text detection and recognition in sports clocks.
We propose a novel distant supervision technique to automatically build sports clock datasets.
We share our computational architecture pipeline to scale this system in industrial setting.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Comprehensive understanding of key players and actions in multiplayer sports
broadcast videos is a challenging problem. Unlike in news or finance videos,
sports videos have limited text. While both action recognition for multiplayer
sports and detection of players has seen robust research, understanding
contextual text in video frames still remains one of the most impactful avenues
of sports video understanding. In this work we study extremely accurate
semantic text detection and recognition in sports clocks, and challenges
therein. We observe unique properties of sports clocks, which makes it hard to
utilize general-purpose pre-trained detectors and recognizers, so that text can
be accurately understood to the degree of being used to align to external
knowledge. We propose a novel distant supervision technique to automatically
build sports clock datasets. Along with suitable data augmentations, combined
with any state-of-the-art text detection and recognition model architectures,
we extract extremely accurate semantic text. Finally, we share our
computational architecture pipeline to scale this system in industrial setting
and proposed a robust dataset for the same to validate our results.
Related papers
- Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video [5.885902974241053]
Reasoning over complex sports scenarios has posed significant challenges to current NLP technologies.
Our evaluation spans from simple queries on basic rules and historical facts to complex, context-specific reasoning.
We propose a new benchmark based on a comprehensive overview of existing sports datasets and provided extensive error analysis.
arXiv Detail & Related papers (2024-06-21T05:57:50Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching [77.0306273129475]
Video text spotting presents an augmented challenge with the inclusion of tracking.
GoMatching focuses the training efforts on tracking while maintaining strong recognition performance.
GoMatching delivers new records on ICDAR15-video, DSText, BOVText, and our proposed novel test with arbitrary-shaped text termed ArTVideo.
arXiv Detail & Related papers (2024-01-13T13:59:15Z) - Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex
and Professional Sports [90.79212954022218]
We introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task.
Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions.
We propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering.
arXiv Detail & Related papers (2024-01-03T02:22:34Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - Sports Video Analysis on Large-Scale Data [10.24207108909385]
This paper investigates the modeling of automated machine description on sports video.
We propose a novel large-scale NBA dataset for Sports Video Analysis (NSVA) with a focus on captioning.
arXiv Detail & Related papers (2022-08-09T16:59:24Z) - A Survey on Video Action Recognition in Sports: Datasets, Methods and
Applications [60.3327085463545]
We present a survey on video action recognition for sports analytics.
We introduce more than ten types of sports, including team sports, such as football, basketball, volleyball, hockey and individual sports, such as figure skating, gymnastics, table tennis, diving and badminton.
We develop a toolbox using PaddlePaddle, which supports football, basketball, table tennis and figure skating action recognition.
arXiv Detail & Related papers (2022-06-02T13:19:36Z) - A New Action Recognition Framework for Video Highlights Summarization in
Sporting Events [9.870478438166288]
We present a framework to automatically clip the sports video stream by using a three-level prediction algorithm based on two classical open-source structures, i.e., YOLO-v3 and OpenPose.
It is found that by using a modest amount of sports video training data, our methodology can perform sports activity highlights clipping accurately.
arXiv Detail & Related papers (2020-12-01T04:14:40Z) - SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding of
Broadcast Soccer Videos [71.72665910128975]
SoccerNet-v2 is a novel large-scale corpus of manual annotations for the SoccerNet video dataset.
We release around 300k annotations within SoccerNet's 500 untrimmed broadcast soccer videos.
We extend current tasks in the realm of soccer to include action spotting, camera shot segmentation with boundary detection.
arXiv Detail & Related papers (2020-11-26T16:10:16Z) - Event detection in coarsely annotated sports videos via parallel multi
receptive field 1D convolutions [14.30009544149561]
In problems such as sports video analytics, it is difficult to obtain accurate frame level annotations and exact event duration.
We propose the task of event detection in coarsely annotated videos.
We introduce a multi-tower temporal convolutional network architecture for the proposed task.
arXiv Detail & Related papers (2020-04-13T19:51:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.