Large-scale Pre-training for Grounded Video Caption Generation
- URL: http://arxiv.org/abs/2503.10781v3
- Date: Tue, 09 Sep 2025 12:36:30 GMT
- Title: Large-scale Pre-training for Grounded Video Caption Generation
- Authors: Evangelos Kazakos, Cordelia Schmid, Josef Sivic,
- Abstract summary: We propose a novel approach for captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally dense bounding boxes.<n>We present a large-scale automatic annotation method that aggregates frame-level captions grounded with bounding boxes into temporally dense and consistent annotations.<n>We demonstrate that our approach achieves state-of-the-art results on the proposed iGround dataset, as well as on the VidSTG, ActivityNet-Entities, GroundingYouTube, and YouCook-Interactions datasets.
- Score: 67.74116645708892
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a novel approach for captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally dense bounding boxes. We introduce the following contributions. First, we present a large-scale automatic annotation method that aggregates frame-level captions grounded with bounding boxes into temporally dense and consistent annotations. We apply this approach on the HowTo100M dataset to construct a large-scale pre-training dataset, named HowToGround1M. We also introduce a Grounded Video Caption Generation model, dubbed GROVE, and pre-train the model on HowToGround1M. Second, we introduce iGround--a dataset of 3513 videos with manually annotated captions and dense spatio-temporally grounded bounding boxes. This allows us to measure progress on this challenging problem, as well as to fine-tune our model on this small-scale but high-quality data. Third, we demonstrate that our approach achieves state-of-the-art results on the proposed iGround dataset, as well as on the VidSTG, ActivityNet-Entities, GroundingYouTube, and YouCook-Interactions datasets. Our ablations demonstrate the importance of pre-training on our automatically annotated HowToGround1M dataset followed by fine-tuning on the manually annotated iGround dataset and validate the key technical contributions of our model. The dataset and code are available at https://ekazakos.github.io/grounded_video_caption_generation/.
Related papers
- VoCap: Video Object Captioning and Segmentation from Any Prompt [78.90048335805047]
VoCap is a flexible model that consumes a video segmentation and a prompt understanding of various modalities.<n>It addresses promptable video object segmentation, referring, and object captioning.<n>Our model yields state-the-art results on referring expression video object segmentation.
arXiv Detail & Related papers (2025-08-29T17:43:58Z) - VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion [9.465414294387507]
We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption.<n>We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior to video diffusion models.<n>We introduce a large-scale video referring matting dataset with 10,000 videos.
arXiv Detail & Related papers (2025-03-11T06:12:35Z) - GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning [20.210972863275924]
We introduce a Granularity EXpansion (GEX) method with Integration and Compression operations to expand the granularity of a single-grained dataset.
To better model multi-grained data, we introduce an Iterative Approximation Module (IAM) which embeds multi-grained videos and texts into a unified, low-dimensional semantic space.
We evaluate our work on three categories of video tasks across seven benchmark datasets, showcasing state-of-the-art or comparable performance.
arXiv Detail & Related papers (2024-12-10T17:50:53Z) - Grounded Video Caption Generation [74.23767687855279]
We propose a new task, dataset and model for grounded video caption generation.
This task unifies captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally consistent bounding boxes.
We introduce a new grounded video caption generation model, called VideoGround, and train the model on the new automatically annotated HowToGround dataset.
arXiv Detail & Related papers (2024-11-12T06:44:24Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast
Contrastive Fusion [110.84357383258818]
We propose a novel approach to lift 2D segments to 3D and fuse them by means of a neural field representation.
The core of our approach is a slow-fast clustering objective function, which is scalable and well-suited for scenes with a large number of objects.
Our approach outperforms the state-of-the-art on challenging scenes from the ScanNet, Hypersim, and Replica datasets.
arXiv Detail & Related papers (2023-06-07T17:57:45Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z) - Attention-guided Temporal Coherent Video Object Matting [78.82835351423383]
We propose a novel deep learning-based object matting method that can achieve temporally coherent matting results.
Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength.
We show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network.
arXiv Detail & Related papers (2021-05-24T17:34:57Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.