PR-DETR: Injecting Position and Relation Prior for Dense Video Captioning
- URL: http://arxiv.org/abs/2506.16082v1
- Date: Thu, 19 Jun 2025 07:07:42 GMT
- Title: PR-DETR: Injecting Position and Relation Prior for Dense Video Captioning
- Authors: Yizhe Li, Sanping Zhou, Zheng Qin, Le Wang,
- Abstract summary: We propose a novel dense video captioning framework, named PR-DETR, which injects the explicit position and relation prior into the detection transformer.<n>On the one hand, we first generate a set of position-anchored queries to provide the scene-specific position and semantic information about potential events as position prior.<n>On the other hand, we further design an event relation encoder to explicitly calculate the relationship between event boundaries as relation prior to guide the event interaction to improve the semantic coherence of the captions.
- Score: 23.4119982709261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dense video captioning is a challenging task that aims to localize and caption multiple events in an untrimmed video. Recent studies mainly follow the transformer-based architecture to jointly perform the two sub-tasks, i.e., event localization and caption generation, in an end-to-end manner. Based on the general philosophy of detection transformer, these methods implicitly learn the event locations and event semantics, which requires a large amount of training data and limits the model's performance in practice. In this paper, we propose a novel dense video captioning framework, named PR-DETR, which injects the explicit position and relation prior into the detection transformer to improve the localization accuracy and caption quality, simultaneously. On the one hand, we first generate a set of position-anchored queries to provide the scene-specific position and semantic information about potential events as position prior, which serves as the initial event search regions to eliminate the implausible event proposals. On the other hand, we further design an event relation encoder to explicitly calculate the relationship between event boundaries as relation prior to guide the event interaction to improve the semantic coherence of the captions. Extensive ablation studies are conducted to verify the effectiveness of the position and relation prior. Experimental results also show the competitive performance of our method on ActivityNet Captions and YouCook2 datasets.
Related papers
- Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Leveraging the Video-level Semantic Consistency of Event for
Audio-visual Event Localization [8.530561069113716]
We propose a novel video-level semantic consistency guidance network for the AVE localization task.
It consists of two components: a cross-modal event representation extractor and an intra-modal semantic consistency enhancer.
We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-10-11T08:15:57Z) - Unifying Event Detection and Captioning as Sequence Generation via
Pre-Training [53.613265415703815]
We propose a unified pre-training and fine-tuning framework to enhance the inter-task association between event detection and captioning.
Our model outperforms the state-of-the-art methods, and can be further boosted when pre-trained on extra large-scale video-text data.
arXiv Detail & Related papers (2022-07-18T14:18:13Z) - Learning Constraints and Descriptive Segmentation for Subevent Detection [74.48201657623218]
We propose an approach to learning and enforcing constraints that capture dependencies between subevent detection and EventSeg prediction.
We adopt Rectifier Networks for constraint learning and then convert the learned constraints to a regularization term in the loss function of the neural model.
arXiv Detail & Related papers (2021-09-13T20:50:37Z) - PcmNet: Position-Sensitive Context Modeling Network for Temporal Action
Localization [11.685362686431446]
We propose a temporal-position-sensitive context modeling approach to incorporate both positional and semantic information for more precise action localization.
We achieve state-of-the-art performance on both two challenging datasets, THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2021-03-09T07:34:01Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z) - Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring
Sequential Events Detection for Dense Video Captioning [63.91369308085091]
We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video.
The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass.
The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
arXiv Detail & Related papers (2020-06-14T13:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.