Knowledge Guided Entity-aware Video Captioning and A Basketball
Benchmark
- URL: http://arxiv.org/abs/2401.13888v2
- Date: Wed, 28 Feb 2024 02:26:03 GMT
- Title: Knowledge Guided Entity-aware Video Captioning and A Basketball
Benchmark
- Authors: Zeyu Xi and Ge Shi and Xuefen Li and Junchi Yan and Zun Li and Lifang
Wu and Zilin Liu and Liang Wang
- Abstract summary: We construct a multimodal basketball game knowledge graph (KG_NBA_2022) to provide additional knowledge beyond videos.
Then, a dataset that contains 9 types of fine-grained shooting events and 286 players' knowledge is constructed based on KG_NBA_2022.
We develop a knowledge guided entity-aware video captioning network (KEANet) based on a candidate player list in encoder-decoder form for basketball live text broadcast.
- Score: 49.54265459763042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the recent emergence of video captioning models, how to generate the
text description with specific entity names and fine-grained actions is far
from being solved, which however has great applications such as basketball live
text broadcast. In this paper, a new multimodal knowledge graph supported
basketball benchmark for video captioning is proposed. Specifically, we
construct a multimodal basketball game knowledge graph (KG_NBA_2022) to provide
additional knowledge beyond videos. Then, a multimodal basketball game video
captioning (VC_NBA_2022) dataset that contains 9 types of fine-grained shooting
events and 286 players' knowledge (i.e., images and names) is constructed based
on KG_NBA_2022. We develop a knowledge guided entity-aware video captioning
network (KEANet) based on a candidate player list in encoder-decoder form for
basketball live text broadcast. The temporal contextual information in video is
encoded by introducing the bi-directional GRU (Bi-GRU) module. And the
entity-aware module is designed to model the relationships among the players
and highlight the key players. Extensive experiments on multiple sports
benchmarks demonstrate that KEANet effectively leverages extera knowledge and
outperforms advanced video captioning models. The proposed dataset and
corresponding codes will be publicly available soon
Related papers
- A Simple and Effective Temporal Grounding Pipeline for Basketball Broadcast Footage [0.0]
We present a reliable temporal grounding pipeline for video-to-analytic alignment of basketball broadcast footage.
Our method aligns a pre-labeled corpus of play-by-play annotations containing dense event annotations to video frames, enabling quick retrieval of labeled video segments.
arXiv Detail & Related papers (2024-10-30T17:27:44Z) - Shot2Story20K: A New Benchmark for Comprehensive Understanding of
Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - Videoprompter: an ensemble of foundational models for zero-shot video
understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations.
We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for
Real-time Soccer Commentary Generation [75.60413443783953]
We present GOAL, a benchmark of over 8.9k soccer video clips, 22k sentences, and 42k knowledge triples for proposing a challenging new task setting as Knowledge-grounded Video Captioning (KGVC)
Our data and code are available at https://github.com/THU-KEG/goal.
arXiv Detail & Related papers (2023-03-26T08:43:36Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - P2ANet: A Dataset and Benchmark for Dense Action Detection from Table Tennis Match Broadcasting Videos [64.57435509822416]
This work consists of 2,721 video clips collected from the broadcasting videos of professional table tennis matches in World Table Tennis Championships and Olympiads.
We formulate two sets of action detection problems -- emphaction localization and emphaction recognition.
The results confirm that TheName is still a challenging task and can be used as a special benchmark for dense action detection from videos.
arXiv Detail & Related papers (2022-07-26T08:34:17Z) - Towards Diverse Paragraph Captioning for Untrimmed Videos [40.205433926432434]
Existing approaches mainly solve the problem in two steps: event detection and then event captioning.
We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
arXiv Detail & Related papers (2021-05-30T09:28:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.