GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for
Real-time Soccer Commentary Generation
- URL: http://arxiv.org/abs/2303.14655v2
- Date: Thu, 5 Oct 2023 06:55:13 GMT
- Title: GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for
Real-time Soccer Commentary Generation
- Authors: Ji Qi, Jifan Yu, Teng Tu, Kunyu Gao, Yifan Xu, Xinyu Guan, Xiaozhi
Wang, Yuxiao Dong, Bin Xu, Lei Hou, Juanzi Li, Jie Tang, Weidong Guo, Hui
Liu, Yu Xu
- Abstract summary: We present GOAL, a benchmark of over 8.9k soccer video clips, 22k sentences, and 42k knowledge triples for proposing a challenging new task setting as Knowledge-grounded Video Captioning (KGVC)
Our data and code are available at https://github.com/THU-KEG/goal.
- Score: 75.60413443783953
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the recent emergence of video captioning models, how to generate
vivid, fine-grained video descriptions based on the background knowledge (i.e.,
long and informative commentary about the domain-specific scenes with
appropriate reasoning) is still far from being solved, which however has great
applications such as automatic sports narrative. In this paper, we present
GOAL, a benchmark of over 8.9k soccer video clips, 22k sentences, and 42k
knowledge triples for proposing a challenging new task setting as
Knowledge-grounded Video Captioning (KGVC). Moreover, we conduct experimental
adaption of existing methods to show the difficulty and potential directions
for solving this valuable and applicable task. Our data and code are available
at https://github.com/THU-KEG/goal.
Related papers
- OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Detours for Navigating Instructional Videos [58.1645668396789]
We propose VidDetours, a video-language approach that learns to retrieve the targeted temporal segments from a large repository of how-to's.
We show our model's significant improvements over best available methods for video retrieval and question answering, with recall rates exceeding the state of the art by 35%.
arXiv Detail & Related papers (2024-01-03T16:38:56Z) - Implicit and Explicit Commonsense for Multi-sentence Video Captioning [33.969215964292395]
We propose a novel video captioning Transformer-based model that takes into account both implicit (visuo-lingual and purely linguistic) and explicit (knowledge-base) commonsense knowledge.
We show that these forms of knowledge, in isolation and in combination, enhance the quality of produced captions.
arXiv Detail & Related papers (2023-03-14T00:19:11Z) - Going for GOAL: A Resource for Grounded Football Commentaries [66.10040637644697]
We present GrOunded footbAlL commentaries (GOAL), a novel dataset of football (or soccer') highlights videos with transcribed live commentaries in English.
We provide state-of-the-art baselines for the following tasks: frame reordering, moment retrieval, live commentary retrieval and play-by-play live commentary generation.
Results show that SOTA models perform reasonably well in most tasks.
arXiv Detail & Related papers (2022-11-08T20:04:27Z) - Learning Transferable Spatiotemporal Representations from Natural Script
Knowledge [65.40899722211726]
We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations.
The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
arXiv Detail & Related papers (2022-09-30T07:39:48Z) - A Multi-stage deep architecture for summary generation of soccer videos [11.41978608521222]
We propose a method to generate the summary of a soccer match exploiting both the audio and the event metadata.
The results show that our method can detect the actions of the match, identify which of these actions should belong to the summary and then propose multiple candidate summaries.
arXiv Detail & Related papers (2022-05-02T07:26:35Z) - SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding of
Broadcast Soccer Videos [71.72665910128975]
SoccerNet-v2 is a novel large-scale corpus of manual annotations for the SoccerNet video dataset.
We release around 300k annotations within SoccerNet's 500 untrimmed broadcast soccer videos.
We extend current tasks in the realm of soccer to include action spotting, camera shot segmentation with boundary detection.
arXiv Detail & Related papers (2020-11-26T16:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.