Unifying Event Detection and Captioning as Sequence Generation via
Pre-Training
- URL: http://arxiv.org/abs/2207.08625v1
- Date: Mon, 18 Jul 2022 14:18:13 GMT
- Title: Unifying Event Detection and Captioning as Sequence Generation via
Pre-Training
- Authors: Qi Zhang and Yuqing Song and Qin Jin
- Abstract summary: We propose a unified pre-training and fine-tuning framework to enhance the inter-task association between event detection and captioning.
Our model outperforms the state-of-the-art methods, and can be further boosted when pre-trained on extra large-scale video-text data.
- Score: 53.613265415703815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dense video captioning aims to generate corresponding text descriptions for a
series of events in the untrimmed video, which can be divided into two
sub-tasks, event detection and event captioning. Unlike previous works that
tackle the two sub-tasks separately, recent works have focused on enhancing the
inter-task association between the two sub-tasks. However, designing inter-task
interactions for event detection and captioning is not trivial due to the large
differences in their task specific solutions. Besides, previous event detection
methods normally ignore temporal dependencies between events, leading to event
redundancy or inconsistency problems. To tackle above the two defects, in this
paper, we define event detection as a sequence generation task and propose a
unified pre-training and fine-tuning framework to naturally enhance the
inter-task association between event detection and captioning. Since the model
predicts each event with previous events as context, the inter-dependency
between events is fully exploited and thus our model can detect more diverse
and consistent events in the video. Experiments on the ActivityNet dataset show
that our model outperforms the state-of-the-art methods, and can be further
boosted when pre-trained on extra large-scale video-text data. Code is
available at \url{https://github.com/QiQAng/UEDVC}.
Related papers
- Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - Double Mixture: Towards Continual Event Detection from Speech [60.33088725100812]
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events.
This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events.
We propose a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.
arXiv Detail & Related papers (2024-04-20T06:32:00Z) - Improving Event Definition Following For Zero-Shot Event Detection [66.27883872707523]
Existing approaches on zero-shot event detection usually train models on datasets annotated with known event types.
We aim to improve zero-shot event detection by training models to better follow event definitions.
arXiv Detail & Related papers (2024-03-05T01:46:50Z) - Pretext Training Algorithms for Event Sequence Data [29.70078362944441]
This paper proposes a self-supervised pretext training framework tailored to event sequence data.
Our pretext tasks unlock foundational representations that are generalizable across different down-stream tasks.
arXiv Detail & Related papers (2024-02-16T01:25:21Z) - Semantic Pivoting Model for Effective Event Detection [19.205550116466604]
Event Detection aims to identify and classify mentions of event instances from unstructured articles.
Existing techniques for event detection only use homogeneous one-hot vectors to represent the event type classes, ignoring the fact that the semantic meaning of the types is important to the task.
We propose a Semantic Pivoting Model for Effective Event Detection (SPEED), which explicitly incorporates prior information during training and captures semantically meaningful correlations between input and events.
arXiv Detail & Related papers (2022-11-01T19:20:34Z) - Learning Constraints and Descriptive Segmentation for Subevent Detection [74.48201657623218]
We propose an approach to learning and enforcing constraints that capture dependencies between subevent detection and EventSeg prediction.
We adopt Rectifier Networks for constraint learning and then convert the learned constraints to a regularization term in the loss function of the neural model.
arXiv Detail & Related papers (2021-09-13T20:50:37Z) - Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring
Sequential Events Detection for Dense Video Captioning [63.91369308085091]
We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video.
The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass.
The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
arXiv Detail & Related papers (2020-06-14T13:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.