Large-scale Pre-training for Grounded Video Caption Generation
- URL: http://arxiv.org/abs/2503.10781v1
- Date: Thu, 13 Mar 2025 18:21:07 GMT
- Title: Large-scale Pre-training for Grounded Video Caption Generation
- Authors: Evangelos Kazakos, Cordelia Schmid, Josef Sivic,
- Abstract summary: We propose a novel approach for captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally dense bounding boxes.<n>We present a large-scale automatic annotation method that aggregates captions grounded with bounding boxes across individual frames into temporally dense and consistent bounding box annotations.<n>We introduce a new dataset, called iGround, of 3500 videos with manually annotated captions and dense-temporally grounded bounding boxes.
- Score: 74.23767687855279
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a novel approach for captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally dense bounding boxes. We introduce the following contributions. First, we present a large-scale automatic annotation method that aggregates captions grounded with bounding boxes across individual frames into temporally dense and consistent bounding box annotations. We apply this approach on the HowTo100M dataset to construct a large-scale pre-training dataset, named HowToGround1M. We also introduce a Grounded Video Caption Generation model, dubbed GROVE, and pre-train the model on HowToGround1M. Second, we introduce a new dataset, called iGround, of 3500 videos with manually annotated captions and dense spatio-temporally grounded bounding boxes. This allows us to measure progress on this challenging problem, as well as to fine-tune our model on this small-scale but high-quality data. Third, we demonstrate that our approach achieves state-of-the-art results on the proposed iGround dataset compared to a number of baselines, as well as on the VidSTG and ActivityNet-Entities datasets. We perform extensive ablations that demonstrate the importance of pre-training using our automatically annotated HowToGround1M dataset followed by fine-tuning on the manually annotated iGround dataset and validate the key technical contributions of our model.
Related papers
- VRMDiff: Text-Guided Video Referring Matting Generation of Diffusion [9.465414294387507]
We propose a new task, video referring matting, which obtains the alpha matte of a specified instance by inputting a referring caption.<n>We treat the dense prediction task of matting as video generation, leveraging the text-to-video alignment prior to video diffusion models.<n>We introduce a large-scale video referring matting dataset with 10,000 videos.
arXiv Detail & Related papers (2025-03-11T06:12:35Z) - GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning [20.210972863275924]
We introduce a Granularity EXpansion (GEX) method with Integration and Compression operations to expand the granularity of a single-grained dataset.
To better model multi-grained data, we introduce an Iterative Approximation Module (IAM) which embeds multi-grained videos and texts into a unified, low-dimensional semantic space.
We evaluate our work on three categories of video tasks across seven benchmark datasets, showcasing state-of-the-art or comparable performance.
arXiv Detail & Related papers (2024-12-10T17:50:53Z) - Grounded Video Caption Generation [74.23767687855279]
We propose a new task, dataset and model for grounded video caption generation.
This task unifies captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally consistent bounding boxes.
We introduce a new grounded video caption generation model, called VideoGround, and train the model on the new automatically annotated HowToGround dataset.
arXiv Detail & Related papers (2024-11-12T06:44:24Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast
Contrastive Fusion [110.84357383258818]
We propose a novel approach to lift 2D segments to 3D and fuse them by means of a neural field representation.
The core of our approach is a slow-fast clustering objective function, which is scalable and well-suited for scenes with a large number of objects.
Our approach outperforms the state-of-the-art on challenging scenes from the ScanNet, Hypersim, and Replica datasets.
arXiv Detail & Related papers (2023-06-07T17:57:45Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.