A Boosted Model Ensembling Approach to Ball Action Spotting in Videos:
The Runner-Up Solution to CVPR'23 SoccerNet Challenge
- URL: http://arxiv.org/abs/2306.05772v2
- Date: Mon, 12 Jun 2023 05:18:34 GMT
- Title: A Boosted Model Ensembling Approach to Ball Action Spotting in Videos:
The Runner-Up Solution to CVPR'23 SoccerNet Challenge
- Authors: Luping Wang, Hao Guo, Bin Liu
- Abstract summary: This report presents our solution to Ball Action Spotting in videos.
Our method reached second place in the CVPR'23 SoccerNet Challenge.
- Score: 13.784332796429556
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This technical report presents our solution to Ball Action Spotting in
videos. Our method reached second place in the CVPR'23 SoccerNet Challenge.
Details of this challenge can be found at
https://www.soccer-net.org/tasks/ball-action-spotting. Our approach is
developed based on a baseline model termed E2E-Spot, which was provided by the
organizer of this competition. We first generated several variants of the
E2E-Spot model, resulting in a candidate model set. We then proposed a strategy
for selecting appropriate model members from this set and assigning an
appropriate weight to each model. The aim of this strategy is to boost the
performance of the resulting model ensemble. Therefore, we call our approach
Boosted Model Ensembling (BME). Our code is available at
https://github.com/ZJLAB-AMMI/E2E-Spot-MBS.
Related papers
- First Place Solution to the Multiple-choice Video QA Track of The Second Perception Test Challenge [4.075139470537149]
We present our first-place solution to the Multiple-choice Video Question Answering track of The Second Perception Test Challenge.
This competition posed a complex video understanding task, requiring models to accurately comprehend and answer questions about video content.
arXiv Detail & Related papers (2024-09-20T14:31:13Z) - A Foundation Model for Soccer [0.0]
We propose a foundation model for soccer, which is able to predict subsequent actions in a soccer match from a given input sequence of actions.
As a proof of concept, we train a transformer architecture on three seasons of data from a professional soccer league.
arXiv Detail & Related papers (2024-07-18T15:42:08Z) - Second Place Solution of WSDM2023 Toloka Visual Question Answering Challenge [9.915564470970049]
We present our solution for the WSDM2023 Toloka Visual Question Answering Challenge.
Inspired by the application of multimodal pre-trained models, we designed a three-stage solution.
Our team achieved a score of 76.342 on the final leaderboard, ranking second.
arXiv Detail & Related papers (2024-07-05T04:56:05Z) - EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - GroundNLQ @ Ego4D Natural Language Queries Challenge 2023 [73.12670280220992]
To accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required.
We leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations.
In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module.
arXiv Detail & Related papers (2023-06-27T07:27:52Z) - Zero-shot Visual Question Answering with Language Model Feedback [83.65140324876536]
We propose a language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA)
Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-trained Language model (PLM)
arXiv Detail & Related papers (2023-05-26T15:04:20Z) - Deep Model Assembling [31.88606253639418]
This paper studies a divide-and-conquer strategy to train large models.
It divides a large model into smaller modules, training them independently, and reassembling the trained modules to obtain the target model.
We introduce a global, shared meta model to implicitly link all the modules together.
This enables us to train highly compatible modules that collaborate effectively when they are assembled together.
arXiv Detail & Related papers (2022-12-08T08:04:06Z) - Model Extraction Attack against Self-supervised Speech Models [52.81330435990717]
Self-supervised learning (SSL) speech models generate meaningful representations of given clips.
Model extraction attack (MEA) often refers to an adversary stealing the functionality of the victim model with only query access.
We study the MEA problem against SSL speech model with a small number of queries.
arXiv Detail & Related papers (2022-11-29T09:28:05Z) - REST: REtrieve & Self-Train for generative action recognition [54.90704746573636]
We propose to adapt a pre-trained generative Vision & Language (V&L) Foundation Model for video/action recognition.
We show that direct fine-tuning of a generative model to produce action classes suffers from severe overfitting.
We introduce REST, a training framework consisting of two key components.
arXiv Detail & Related papers (2022-09-29T17:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.