Towards Real-Time Open-Vocabulary Video Instance Segmentation
- URL: http://arxiv.org/abs/2412.04434v1
- Date: Thu, 05 Dec 2024 18:53:13 GMT
- Title: Towards Real-Time Open-Vocabulary Video Instance Segmentation
- Authors: Bin Yan, Martin Sundermeyer, David Joseph Tan, Huchuan Lu, Federico Tombari,
- Abstract summary: We propose a new method for performing open-vocabulary video instance segmentation (OV-VIS) in real-time.
TROY-VIS achieves the best trade-off between accuracy and speed on two large-scale OV-VIS benchmarks.
These results demonstrate TROY-VIS's potential for real-time applications in dynamic environments such as mobile robotics and augmented reality.
- Score: 88.04508795121681
- License:
- Abstract: In this paper, we address the challenge of performing open-vocabulary video instance segmentation (OV-VIS) in real-time. We analyze the computational bottlenecks of state-of-the-art foundation models that performs OV-VIS, and propose a new method, TROY-VIS, that significantly improves processing speed while maintaining high accuracy. We introduce three key techniques: (1) Decoupled Attention Feature Enhancer to speed up information interaction between different modalities and scales; (2) Flash Embedding Memory for obtaining fast text embeddings of object categories; and, (3) Kernel Interpolation for exploiting the temporal continuity in videos. Our experiments demonstrate that TROY-VIS achieves the best trade-off between accuracy and speed on two large-scale OV-VIS benchmarks, BURST and LV-VIS, running 20x faster than GLEE-Lite (25 FPS v.s. 1.25 FPS) with comparable or even better accuracy. These results demonstrate TROY-VIS's potential for real-time applications in dynamic environments such as mobile robotics and augmented reality. Code and model will be released at https://github.com/google-research/troyvis.
Related papers
- DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - DVIS: Decoupled Video Instance Segmentation Framework [15.571072365208872]
Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing.
Existing methods often underperform on complex and long videos in real world, primarily due to two factors.
We propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement.
arXiv Detail & Related papers (2023-06-06T05:24:15Z) - Efficient Video Action Detection with Token Dropout and Context
Refinement [67.10895416008911]
We propose an end-to-end framework for efficient video action detection (ViTs)
In a video clip, we maintain tokens from its perspective while preserving tokens relevant to actor motions from other frames.
Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities.
arXiv Detail & Related papers (2023-04-17T17:21:21Z) - Temporally Efficient Vision Transformer for Video Instance Segmentation [40.32376033054237]
We propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS)
TeViT is nearly convolution-free, which contains a transformer backbone and a query-based video instance segmentation head.
On three widely adopted VIS benchmarks, TeViT obtains state-of-the-art results and maintains high inference speed.
arXiv Detail & Related papers (2022-04-18T17:09:20Z) - Deformable VisTR: Spatio temporal deformable attention for video
instance segmentation [79.76273774737555]
Video instance segmentation (VIS) task requires segmenting, classifying, and tracking object instances over all frames in a clip.
Recently, VisTR has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance.
We propose Deformable VisTR, leveragingtemporal deformable attention module that only attends to a small fixed set key-temporal sampling points.
arXiv Detail & Related papers (2022-03-12T02:27:14Z) - Crossover Learning for Fast Online Video Instance Segmentation [53.5613957875507]
We present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.
To our knowledge, CrossVIS achieves state-of-the-art performance among all online VIS methods and shows a decent trade-off between latency and accuracy.
arXiv Detail & Related papers (2021-04-13T06:47:40Z) - MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency.
MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.