OW-VISCap: Open-World Video Instance Segmentation and Captioning
- URL: http://arxiv.org/abs/2404.03657v1
- Date: Thu, 4 Apr 2024 17:59:58 GMT
- Title: OW-VISCap: Open-World Video Instance Segmentation and Captioning
- Authors: Anwesa Choudhuri, Girish Chowdhary, Alexander G. Schwing,
- Abstract summary: We propose an approach to jointly segment, track, and caption previously seen or unseen objects in a video.
We generate rich descriptive and object-centric captions for each detected object via a masked attention augmented LLM input.
Our approach matches or surpasses state-of-the-art on three tasks.
- Score: 95.6696714640357
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Open-world video instance segmentation is an important video understanding task. Yet most methods either operate in a closed-world setting, require an additional user-input, or use classic region-based proposals to identify never before seen objects. Further, these methods only assign a one-word label to detected objects, and don't generate rich object-centric descriptions. They also often suffer from highly overlapping predictions. To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video. For this, we introduce open-world object queries to discover never before seen objects without additional user-input. We generate rich and descriptive object-centric captions for each detected object via a masked attention augmented LLM input. We introduce an inter-query contrastive loss to ensure that the object queries differ from one another. Our generalized approach matches or surpasses state-of-the-art on three tasks: open-world video instance segmentation on the BURST dataset, dense video object captioning on the VidSTG dataset, and closed-world video instance segmentation on the OVIS dataset.
Related papers
- VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS)
This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities.
We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z) - 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
We trained our model on a large-scale video object segmentation dataset.
Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z) - ClickVOS: Click Video Object Segmentation [29.20434078000283]
Video Object (VOS) task aims to segment objects in videos.
To address these limitations, we propose the setting named Click Video Object (ClickVOS)
ClickVOS segments objects of interest across the whole video according to a single click per object in the first frame.
arXiv Detail & Related papers (2024-03-10T08:37:37Z) - MeViS: A Large-scale Benchmark for Video Segmentation with Motion
Expressions [93.35942025232943]
We propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments.
The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms.
arXiv Detail & Related papers (2023-08-16T17:58:34Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Slot-VPS: Object-centric Representation Learning for Video Panoptic
Segmentation [29.454785969084384]
Video Panoptic (VPS) aims at assigning a class label to each pixel, uniquely segmenting and identifying all object instances consistently across all frames.
We present Slot-VPS, the first end-to-end framework for this task.
We encode all panoptic entities in a video, including instances and background semantics, with a unified representation called panoptic slots.
The coherent-temporal object's information is retrieved and encoded into the panoptic slots by proposed the Video Panoptic Retriever, enabling it to localize, segment, differentiate, and associate objects in a unified manner.
arXiv Detail & Related papers (2021-12-16T15:12:22Z) - Unidentified Video Objects: A Benchmark for Dense, Open-World
Segmentation [29.81399150391822]
We present UVO, a new benchmark for open-world class-agnostic object segmentation in videos.
UVO provides approximately 8 times more videos compared with DAVIS, and 7 times more mask (instance) annotations per video compared with YouTube-VOS and YouTube-VIS.
UVO is also more challenging as it includes many videos with crowded scenes and complex background motions.
arXiv Detail & Related papers (2021-04-10T06:16:25Z) - VideoClick: Video Object Segmentation with a Single Click [93.7733828038616]
We propose a bottom up approach where given a single click for each object in a video, we obtain the segmentation masks of these objects in the full video.
In particular, we construct a correlation volume that assigns each pixel in a target frame to either one of the objects in the reference frame or the background.
Results on this new CityscapesVideo dataset show that our approach outperforms all the baselines in this challenging setting.
arXiv Detail & Related papers (2021-01-16T23:07:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.