SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset
- URL: http://arxiv.org/abs/2405.07354v1
- Date: Sun, 12 May 2024 18:25:38 GMT
- Title: SoccerNet-Echoes: A Soccer Game Audio Commentary Dataset
- Authors: Sushant Gautam, Mehdi Houshmand Sarkhoosh, Jan Held, Cise Midoglu, Anthony Cioppa, Silvio Giancola, Vajira Thambawita, Michael A. Riegler, Pål Halvorsen, Mubarak Shah,
- Abstract summary: This paper presents SoccerNet-Echoes, an augmentation of the SoccerNet dataset with automatically generated transcriptions of audio commentaries from soccer game broadcasts.
By incorporating textual data alongside visual and auditory content, SoccerNet-Echoes aims to serve as a comprehensive resource for the development of algorithms specialized in capturing the dynamics of soccer games.
- Score: 46.60191376520379
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The application of Automatic Speech Recognition (ASR) technology in soccer offers numerous opportunities for sports analytics. Specifically, extracting audio commentaries with ASR provides valuable insights into the events of the game, and opens the door to several downstream applications such as automatic highlight generation. This paper presents SoccerNet-Echoes, an augmentation of the SoccerNet dataset with automatically generated transcriptions of audio commentaries from soccer game broadcasts, enhancing video content with rich layers of textual information derived from the game audio using ASR. These textual commentaries, generated using the Whisper model and translated with Google Translate, extend the usefulness of the SoccerNet dataset in diverse applications such as enhanced action spotting, automatic caption generation, and game summarization. By incorporating textual data alongside visual and auditory content, SoccerNet-Echoes aims to serve as a comprehensive resource for the development of algorithms specialized in capturing the dynamics of soccer games. We detail the methods involved in the curation of this dataset and the integration of ASR. We also highlight the implications of a multimodal approach in sports analytics, and how the enriched dataset can support diverse applications, thus broadening the scope of research and development in the field of sports analytics.
Related papers
- A Toolchain for Comprehensive Audio/Video Analysis Using Deep Learning Based Multimodal Approach (A use case of riot or violent context detection) [40.20142677441689]
We present a toolchain for a comprehensive audio/video analysis by leveraging deep learning based multimodal approach.
By combining individual tasks and analyzing both audio & visual data extracted from input video, the toolchain offers various audio/video-based applications.
arXiv Detail & Related papers (2024-05-02T07:34:31Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions.
To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL)
In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z) - Sports Video Analysis on Large-Scale Data [10.24207108909385]
This paper investigates the modeling of automated machine description on sports video.
We propose a novel large-scale NBA dataset for Sports Video Analysis (NSVA) with a focus on captioning.
arXiv Detail & Related papers (2022-08-09T16:59:24Z) - Automated Audio Captioning: an Overview of Recent Progress and New
Challenges [56.98522404673527]
Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips.
We present a comprehensive review of the published contributions in automated audio captioning, from a variety of existing approaches to evaluation metrics and datasets.
arXiv Detail & Related papers (2022-05-12T08:36:35Z) - A Multi-stage deep architecture for summary generation of soccer videos [11.41978608521222]
We propose a method to generate the summary of a soccer match exploiting both the audio and the event metadata.
The results show that our method can detect the actions of the match, identify which of these actions should belong to the summary and then propose multiple candidate summaries.
arXiv Detail & Related papers (2022-05-02T07:26:35Z) - MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and
GENeration [46.19536568693307]
Multimodal video-audio-text understanding and generation can benefit from datasets that are narrow but rich.
We present a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun.
We sample 375K video clips (3.2s each) and collect text descriptions from human annotators.
arXiv Detail & Related papers (2022-04-17T17:59:09Z) - SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding of
Broadcast Soccer Videos [71.72665910128975]
SoccerNet-v2 is a novel large-scale corpus of manual annotations for the SoccerNet video dataset.
We release around 300k annotations within SoccerNet's 500 untrimmed broadcast soccer videos.
We extend current tasks in the realm of soccer to include action spotting, camera shot segmentation with boundary detection.
arXiv Detail & Related papers (2020-11-26T16:10:16Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.