Dense-Captioning Events in Videos: SYSU Submission to ActivityNet
Challenge 2020
- URL: http://arxiv.org/abs/2006.11693v2
- Date: Wed, 12 Aug 2020 03:44:21 GMT
- Title: Dense-Captioning Events in Videos: SYSU Submission to ActivityNet
Challenge 2020
- Authors: Teng Wang, Huicheng Zheng, Mingjing Yu
- Abstract summary: This report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020.
Our approach achieves a 9.28 METEOR score on the test set.
- Score: 8.462158729006715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This technical report presents a brief description of our submission to the
dense video captioning task of ActivityNet Challenge 2020. Our approach follows
a two-stage pipeline: first, we extract a set of temporal event proposals; then
we propose a multi-event captioning model to capture the event-level temporal
relationships and effectively fuse the multi-modal information. Our approach
achieves a 9.28 METEOR score on the test set.
Related papers
- Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - Perception Test 2023: A Summary of the First Challenge And Outcome [67.0525378209708]
The First Perception Test challenge was held as a half-day workshop alongside the IEEE/CVF International Conference on Computer Vision (ICCV) 2023.
The goal was to benchmarking state-of-the-art video models on the recently proposed Perception Test benchmark.
We summarise in this report the task descriptions, metrics, baselines, and results.
arXiv Detail & Related papers (2023-12-20T15:12:27Z) - End-to-end Dense Video Captioning as Sequence Generation [83.90502354328679]
We show how to model the two subtasks of dense video captioning jointly as one sequence generation task.
Experiments on YouCook2 and ViTT show encouraging results and indicate the feasibility of training complex tasks integrated into large-scale pre-trained models.
arXiv Detail & Related papers (2022-04-18T01:30:54Z) - Semantic-Aware Pretraining for Dense Video Captioning [54.61034574151816]
We present a semantic-aware pretraining method for dense video captioning, which empowers the learned features to recognize high-level semantic concepts.
Our final ensemble model achieves a 10.00 METEOR score on the test set.
arXiv Detail & Related papers (2022-04-13T06:57:23Z) - Joint Multimedia Event Extraction from Video and Article [51.159034070824056]
We propose the first approach to jointly extract events from video and text articles.
First, we propose the first self-supervised multimodal event coreference model.
Second, we introduce the first multimodal transformer which extracts structured event information jointly from both videos and text documents.
arXiv Detail & Related papers (2021-09-27T03:22:12Z) - Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring
Sequential Events Detection for Dense Video Captioning [63.91369308085091]
We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video.
The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass.
The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
arXiv Detail & Related papers (2020-06-14T13:21:37Z) - Temporal Fusion Network for Temporal Action Localization:Submission to
ActivityNet Challenge 2020 (Task E) [45.3218136336925]
This report analyzes a temporal action localization method we used in the HACS competition which is hosted in Activitynet Challenge 2020.
The goal of our task is to locate the start time and end time of the action in the untrimmed video, and predict action category.
By fusing the results of multiple models, our method obtains 40.55% on the validation set and 40.53% on the test set in terms of mAP, and achieves Rank 1 in this challenge.
arXiv Detail & Related papers (2020-06-13T00:33:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.