OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning
- URL: http://arxiv.org/abs/2404.03657v2
- Date: Mon, 09 Dec 2024 18:19:05 GMT
- Title: OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning
- Authors: Anwesa Choudhuri, Girish Chowdhary, Alexander G. Schwing,
- Abstract summary: We propose a new task 'open-world video instance segmentation and captioning'
It requires to detect, segment, track and describe with rich captions never before seen objects.
We develop an object abstractor and an object-to-text abstractor.
- Score: 95.6696714640357
- License:
- Abstract: We propose the new task 'open-world video instance segmentation and captioning'. It requires to detect, segment, track and describe with rich captions never before seen objects. This challenging task can be addressed by developing "abstractors" which connect a vision model and a language foundation model. Concretely, we connect a multi-scale visual feature extractor and a large language model (LLM) by developing an object abstractor and an object-to-text abstractor. The object abstractor, consisting of a prompt encoder and transformer blocks, introduces spatially-diverse open-world object queries to discover never before seen objects in videos. An inter-query contrastive loss further encourages the diversity of object queries. The object-to-text abstractor is augmented with masked cross-attention and acts as a bridge between the object queries and a frozen LLM to generate rich and descriptive object-centric captions for each detected object. Our generalized approach surpasses the baseline that jointly addresses the tasks of open-world video instance segmentation and dense video object captioning by 13% on never before seen objects, and by 10% on object-centric captions.
Related papers
- See It All: Contextualized Late Aggregation for 3D Dense Captioning [38.14179122810755]
3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object.
Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components.
We introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation.
arXiv Detail & Related papers (2024-08-14T16:19:18Z) - In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS)
This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities.
We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z) - OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding [21.64446104872021]
We introduce Open, an innovative approach to build open-vocabulary object-level Neural Fields with fine-grained understanding.
In essence, Open establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level.
The results on multiple datasets demonstrate that Open achieves superior performance in zero-shot semantic and retrieval tasks.
arXiv Detail & Related papers (2024-06-12T08:59:33Z) - 1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
We trained our model on a large-scale video object segmentation dataset.
Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.