Related papers: FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation

FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation

URL: http://arxiv.org/abs/2303.08594v2
Date: Sat, 1 Apr 2023 17:55:21 GMT
Title: FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation
Authors: Junjie He, Pengyu Li, Yifeng Geng, Xuansong Xie
Abstract summary: We present FastInst, a query-based framework for real-time instance segmentation. FastInst can execute at a real-time speed (i.e., 32.5 FPS) while yielding an AP of more than 40. Experiments show that FastInst outperforms most state-of-the-art real-time counterparts.
Score: 17.551277435319083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent attention in instance segmentation has focused on query-based models. Despite being non-maximum suppression (NMS)-free and end-to-end, the superiority of these models on high-accuracy real-time benchmarks has not been well demonstrated. In this paper, we show the strong potential of query-based models on efficient instance segmentation algorithm designs. We present FastInst, a simple, effective query-based framework for real-time instance segmentation. FastInst can execute at a real-time speed (i.e., 32.5 FPS) while yielding an AP of more than 40 (i.e., 40.5 AP) on COCO test-dev without bells and whistles. Specifically, FastInst follows the meta-architecture of recently introduced Mask2Former. Its key designs include instance activation-guided queries, dual-path update strategy, and ground truth mask-guided learning, which enable us to use lighter pixel decoders, fewer Transformer decoder layers, while achieving better performance. The experiments show that FastInst outperforms most state-of-the-art real-time counterparts, including strong fully convolutional baselines, in both speed and accuracy. Code can be found at https://github.com/junjiehe96/FastInst .

Related papers

REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders [52.61034140869803]
Region Network (REN) is a fast and effective model for generating region-based image representations using point prompts.<n>REN bypasses this bottleneck using a lightweight module that directly generates region tokens.<n>It uses a few cross-attention blocks that take point prompts as queries and features from a patch-based image encoder as keys and values to produce region tokens.
arXiv Detail & Related papers (2025-05-23T17:59:33Z)
KernelBench: Can LLMs Write Efficient GPU Kernels? [36.4117525096377]
KernelBench is an open-source framework for evaluating language models' ability to write fast and correct kernels. We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct. Our experiments show that frontier reasoning models perform the best out of the box but still fall short overall.
arXiv Detail & Related papers (2025-02-14T19:30:53Z)
Efficient Temporal Action Segmentation via Boundary-aware Query Voting [51.92693641176378]
BaFormer is a boundary-aware Transformer network that tokenizes each video segment as an instance token. BaFormer significantly reduces the computational costs, utilizing only 6% of the running time.
arXiv Detail & Related papers (2024-05-25T00:44:13Z)
Sparse Instance Activation for Real-Time Instance Segmentation [72.23597664935684]
We propose a conceptually novel, efficient, and fully convolutional framework for real-time instance segmentation. SparseInst has extremely fast inference speed and achieves 40 FPS and 37.9 AP on the COCO benchmark.
arXiv Detail & Related papers (2022-03-24T03:15:39Z)
FastSeq: Make Sequence Generation Faster [20.920579109726024]
We develop FastSeq framework to accelerate sequence generation without accuracy loss. benchmark results on a set of widely used and diverse models demonstrate 4-9x inference speed gain. FastSeq is easy to use with a simple one-line code change.
arXiv Detail & Related papers (2021-06-08T22:25:28Z)
When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable. In order to achieve a better accuracy, we propose two lightweight modules. DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers. QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z)
QueryInst: Parallelly Supervised Mask Query for Instance Segmentation [53.5613957875507]
We present QueryInst, a query based instance segmentation method driven by parallel supervision on dynamic mask heads. We conduct extensive experiments on three challenging benchmarks, i.e., COCO, CityScapes, and YouTube-VIS. QueryInst achieves the best performance among all online VIS approaches and strikes a decent speed-accuracy trade-off.
arXiv Detail & Related papers (2021-05-05T08:38:25Z)
Finding Fast Transformers: One-Shot Neural Architecture Search by Component Composition [11.6409723227448]
Transformer-based models have achieved stateof-the-art results in many tasks in natural language processing. We develop an efficient algorithm to search for fast models while maintaining model quality.
arXiv Detail & Related papers (2020-08-15T23:12:25Z)
Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling. Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z)
Learning Fast and Robust Target Models for Video Object Segmentation [83.3382606349118]
Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time. Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting. We propose a novel VOS architecture consisting of two network components.
arXiv Detail & Related papers (2020-02-27T21:58:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.