Related papers: OW-VISCap: Open-World Video Instance Segmentation and Captioning

OW-VISCap: Open-World Video Instance Segmentation and Captioning

URL: http://arxiv.org/abs/2404.03657v1
Date: Thu, 4 Apr 2024 17:59:58 GMT
Title: OW-VISCap: Open-World Video Instance Segmentation and Captioning
Authors: Anwesa Choudhuri, Girish Chowdhary, Alexander G. Schwing,
Abstract summary: We propose an approach to jointly segment, track, and caption previously seen or unseen objects in a video. We generate rich descriptive and object-centric captions for each detected object via a masked attention augmented LLM input. Our approach matches or surpasses state-of-the-art on three tasks.
Score: 95.6696714640357
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Open-world video instance segmentation is an important video understanding task. Yet most methods either operate in a closed-world setting, require an additional user-input, or use classic region-based proposals to identify never before seen objects. Further, these methods only assign a one-word label to detected objects, and don't generate rich object-centric descriptions. They also often suffer from highly overlapping predictions. To address these issues, we propose Open-World Video Instance Segmentation and Captioning (OW-VISCap), an approach to jointly segment, track, and caption previously seen or unseen objects in a video. For this, we introduce open-world object queries to discover never before seen objects without additional user-input. We generate rich and descriptive object-centric captions for each detected object via a masked attention augmented LLM input. We introduce an inter-query contrastive loss to ensure that the object queries differ from one another. Our generalized approach matches or surpasses state-of-the-art on three tasks: open-world video instance segmentation on the BURST dataset, dense video object captioning on the VidSTG dataset, and closed-world video instance segmentation on the OVIS dataset.

Related papers

InterRVOS: Interaction-aware Referring Video Object Segmentation [37.53744746544299]
Referring video object segmentation aims to segment the object in a video corresponding to a given natural language expression.<n>In comprehensive video understanding, an object's role is often defined by its interactions with other entities.<n>We introduce Interaction-aware referring video object sgementation, a new task that requires segmenting both actor and target entities involved in an interaction.
arXiv Detail & Related papers (2025-06-03T01:16:13Z)
CTRL-O: Language-Controllable Object-Centric Visual Representation Learning [30.218743514199016]
Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files" Current object-centric models learn representations based on their preconceived understanding of objects, without allowing user input to guide which objects are represented. We propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions.
arXiv Detail & Related papers (2025-03-27T17:53:50Z)
See It All: Contextualized Late Aggregation for 3D Dense Captioning [38.14179122810755]
3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object. Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components. We introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation.
arXiv Detail & Related papers (2024-08-14T16:19:18Z)
In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding. Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z)
VISA: Reasoning Video Object Segmentation via Large Language Models [64.33167989521357]
We introduce a new task, Reasoning Video Object (ReasonVOS) This task aims to generate a sequence of segmentation masks in response to implicit text queries that require complex reasoning abilities. We introduce VISA (Video-based large language Instructed Assistant) to tackle ReasonVOS.
arXiv Detail & Related papers (2024-07-16T02:29:29Z)
OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields with Fine-Grained Understanding [21.64446104872021]
We introduce Open, an innovative approach to build open-vocabulary object-level Neural Fields with fine-grained understanding. In essence, Open establishes a robust framework for efficient and watertight scene modeling and comprehension at the object-level. The results on multiple datasets demonstrate that Open achieves superior performance in zero-shot semantic and retrieval tasks.
arXiv Detail & Related papers (2024-06-12T08:59:33Z)
1st Place Solution for MOSE Track in CVPR 2024 PVUW Workshop: Complex Video Object Segmentation [72.54357831350762]
We propose a semantic embedding video object segmentation model and use the salient features of objects as query representations. We trained our model on a large-scale video object segmentation dataset. Our model achieves first place (textbf84.45%) in the test set of Complex Video Object Challenge.
arXiv Detail & Related papers (2024-06-07T03:13:46Z)
ClickVOS: Click Video Object Segmentation [29.20434078000283]
Video Object (VOS) task aims to segment objects in videos. To address these limitations, we propose the setting named Click Video Object (ClickVOS) ClickVOS segments objects of interest across the whole video according to a single click per object in the first frame.
arXiv Detail & Related papers (2024-03-10T08:37:37Z)
Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS) We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z)
MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions [93.35942025232943]
We propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments. The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms.
arXiv Detail & Related papers (2023-08-16T17:58:34Z)
Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z)
Tag-Based Attention Guided Bottom-Up Approach for Video Instance Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence. We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach. Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z)
Slot-VPS: Object-centric Representation Learning for Video Panoptic Segmentation [29.454785969084384]
Video Panoptic (VPS) aims at assigning a class label to each pixel, uniquely segmenting and identifying all object instances consistently across all frames. We present Slot-VPS, the first end-to-end framework for this task. We encode all panoptic entities in a video, including instances and background semantics, with a unified representation called panoptic slots. The coherent-temporal object's information is retrieved and encoded into the panoptic slots by proposed the Video Panoptic Retriever, enabling it to localize, segment, differentiate, and associate objects in a unified manner.
arXiv Detail & Related papers (2021-12-16T15:12:22Z)
Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z)
Unidentified Video Objects: A Benchmark for Dense, Open-World Segmentation [29.81399150391822]
We present UVO, a new benchmark for open-world class-agnostic object segmentation in videos. UVO provides approximately 8 times more videos compared with DAVIS, and 7 times more mask (instance) annotations per video compared with YouTube-VOS and YouTube-VIS. UVO is also more challenging as it includes many videos with crowded scenes and complex background motions.
arXiv Detail & Related papers (2021-04-10T06:16:25Z)
VideoClick: Video Object Segmentation with a Single Click [93.7733828038616]
We propose a bottom up approach where given a single click for each object in a video, we obtain the segmentation masks of these objects in the full video. In particular, we construct a correlation volume that assigns each pixel in a target frame to either one of the objects in the reference frame or the background. Results on this new CityscapesVideo dataset show that our approach outperforms all the baselines in this challenging setting.
arXiv Detail & Related papers (2021-01-16T23:07:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.