VoCap: Video Object Captioning and Segmentation from Any Prompt
- URL: http://arxiv.org/abs/2508.21809v1
- Date: Fri, 29 Aug 2025 17:43:58 GMT
- Title: VoCap: Video Object Captioning and Segmentation from Any Prompt
- Authors: Jasper Uijlings, Xingyi Zhou, Xiuye Gu, Arsha Nagrani, Anurag Arnab, Alireza Fathi, David Ross, Cordelia Schmid,
- Abstract summary: VoCap is a flexible model that consumes a video segmentation and a prompt understanding of various modalities.<n>It addresses promptable video object segmentation, referring, and object captioning.<n>Our model yields state-the-art results on referring expression video object segmentation.
- Score: 78.90048335805047
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at https://github.com/google-deepmind/vocap.
Related papers
- MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation [126.77662882743168]
We introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio.<n>We benchmark 15 existing methods across 4 tasks supported by MeViS.<n>We propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results.
arXiv Detail & Related papers (2025-12-11T18:59:44Z) - MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos [53.837485338819334]
MaskCap is an end-to-end model capable of jointly detecting, segmenting, tracking and captioning object trajectories.<n>The datasets and code are available at https://www.gabriel.fiastre.fr/masker/.
arXiv Detail & Related papers (2025-10-16T17:20:22Z) - ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation [14.534308478766476]
We introduce ViCaS, a new dataset containing thousands of challenging videos.<n>Our benchmark evaluates models on holistic/high-level understanding and language-guided, pixel-precise segmentation.
arXiv Detail & Related papers (2024-12-12T23:10:54Z) - Grounded Video Caption Generation [74.23767687855279]
We propose a new task, dataset and model for grounded video caption generation.
This task unifies captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally consistent bounding boxes.
We introduce a new grounded video caption generation model, called VideoGround, and train the model on the new automatically annotated HowToGround dataset.
arXiv Detail & Related papers (2024-11-12T06:44:24Z) - 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
We trained our model on a large-scale video object segmentation dataset.
Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z) - SOVC: Subject-Oriented Video Captioning [59.04029220586337]
We propose a new video captioning task, Subject-Oriented Video Captioning (SOVC), which aims to allow users to specify the describing target via a bounding box.
To support this task, we construct two subject-oriented video captioning datasets based on two widely used video captioning datasets.
arXiv Detail & Related papers (2023-12-20T17:44:32Z) - MeViS: A Large-scale Benchmark for Video Segmentation with Motion
Expressions [93.35942025232943]
We propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments.
The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms.
arXiv Detail & Related papers (2023-08-16T17:58:34Z) - Event and Entity Extraction from Generated Video Captions [4.987670632802288]
We propose a framework to extract semantic metadata from automatically generated video captions.
As metadata, we consider entities, the entities' properties, relations between entities, and the video category.
We employ two state-of-the-art dense video captioning models to generate captions for videos of the ActivityNet Captions dataset.
arXiv Detail & Related papers (2022-11-05T22:06:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.