Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring
Sequential Events Detection for Dense Video Captioning
- URL: http://arxiv.org/abs/2006.07896v1
- Date: Sun, 14 Jun 2020 13:21:37 GMT
- Title: Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring
Sequential Events Detection for Dense Video Captioning
- Authors: Yuqing Song, Shizhe Chen, Yida Zhao, Qin Jin
- Abstract summary: We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video.
The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass.
The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
- Score: 63.91369308085091
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Detecting meaningful events in an untrimmed video is essential for dense
video captioning. In this work, we propose a novel and simple model for event
sequence generation and explore temporal relationships of the event sequence in
the video. The proposed model omits inefficient two-stage proposal generation
and directly generates event boundaries conditioned on bi-directional temporal
dependency in one pass. Experimental results show that the proposed event
sequence generation model can generate more accurate and diverse events within
a small number of proposals. For the event captioning, we follow our previous
work to employ the intra-event captioning models into our pipeline system. The
overall system achieves state-of-the-art performance on the dense-captioning
events in video task with 9.894 METEOR score on the challenge testing set.
Related papers
- Technical Report for ActivityNet Challenge 2022 -- Temporal Action Localization [20.268572246761895]
We propose to locate the temporal boundaries of each action and predict action class in untrimmed videos.
Faster-TAD simplifies the pipeline of TAD and gets remarkable performance.
arXiv Detail & Related papers (2024-10-31T14:16:56Z) - Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - Improving Event Definition Following For Zero-Shot Event Detection [66.27883872707523]
Existing approaches on zero-shot event detection usually train models on datasets annotated with known event types.
We aim to improve zero-shot event detection by training models to better follow event definitions.
arXiv Detail & Related papers (2024-03-05T01:46:50Z) - Unifying Event Detection and Captioning as Sequence Generation via
Pre-Training [53.613265415703815]
We propose a unified pre-training and fine-tuning framework to enhance the inter-task association between event detection and captioning.
Our model outperforms the state-of-the-art methods, and can be further boosted when pre-trained on extra large-scale video-text data.
arXiv Detail & Related papers (2022-07-18T14:18:13Z) - End-to-end Dense Video Captioning as Sequence Generation [83.90502354328679]
We show how to model the two subtasks of dense video captioning jointly as one sequence generation task.
Experiments on YouCook2 and ViTT show encouraging results and indicate the feasibility of training complex tasks integrated into large-scale pre-trained models.
arXiv Detail & Related papers (2022-04-18T01:30:54Z) - PILED: An Identify-and-Localize Framework for Few-Shot Event Detection [79.66042333016478]
In our study, we employ cloze prompts to elicit event-related knowledge from pretrained language models.
We minimize the number of type-specific parameters, enabling our model to quickly adapt to event detection tasks for new types.
arXiv Detail & Related papers (2022-02-15T18:01:39Z) - Dense-Captioning Events in Videos: SYSU Submission to ActivityNet
Challenge 2020 [8.462158729006715]
This report presents a brief description of our submission to the dense video captioning task of ActivityNet Challenge 2020.
Our approach achieves a 9.28 METEOR score on the test set.
arXiv Detail & Related papers (2020-06-21T02:38:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.