VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement
- URL: http://arxiv.org/abs/2312.04885v2
- Date: Fri, 8 Mar 2024 10:59:20 GMT
- Title: VISAGE: Video Instance Segmentation with Appearance-Guided Enhancement
- Authors: Hanjung Kim, Jaehyun Kang, Miran Heo, Sukjun Hwang, Seoung Wug Oh,
Seon Joo Kim
- Abstract summary: Online Video Instance (VIS) methods have shown remarkable advancement with their powerful query-based detectors.
However, our observations demonstrate that these methods heavily rely on location information, which often causes incorrect associations between objects.
This paper presents that a key axis of object matching in trackers is appearance information, which becomes greatly instructive under conditions where positional cues are insufficient for distinguishing their identities.
- Score: 39.154059294954614
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, online Video Instance Segmentation (VIS) methods have shown
remarkable advancement with their powerful query-based detectors. Utilizing the
output queries of the detector at the frame-level, these methods achieve high
accuracy on challenging benchmarks. However, our observations demonstrate that
these methods heavily rely on location information, which often causes
incorrect associations between objects. This paper presents that a key axis of
object matching in trackers is appearance information, which becomes greatly
instructive under conditions where positional cues are insufficient for
distinguishing their identities. Therefore, we suggest a simple yet powerful
extension to object decoders that explicitly extract embeddings from backbone
features and drive queries to capture the appearances of objects, which greatly
enhances instance association accuracy. Furthermore, recognizing the
limitations of existing benchmarks in fully evaluating appearance awareness, we
have constructed a synthetic dataset to rigorously validate our method. By
effectively resolving the over-reliance on location information, we achieve
state-of-the-art results on YouTube-VIS 2019/2021 and Occluded VIS (OVIS). Code
is available at https://github.com/KimHanjung/VISAGE.
Related papers
- Context-Aware Video Instance Segmentation [12.71520768233772]
We introduce the Context-Aware Video Instance (CAVIS), a novel framework designed to enhance instance association.
We propose the Context-Aware Instance Tracker (CAIT), which merges contextual data surrounding the instances with the core instance features to improve tracking accuracy.
We also introduce the Prototypical Cross-frame Contrastive (PCC) loss, which ensures consistency in object-level features across frames.
arXiv Detail & Related papers (2024-07-03T11:11:16Z) - DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries [60.09774333024783]
We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap between the anchor and target queries.
We also introduce a query-level object Emergence and Disappearance Simulation (EDS) strategy, which unleashes DAQ's potential without any additional cost.
Experiments demonstrate that DVIS-DAQ achieves a new state-of-the-art (SOTA) performance on five mainstream video segmentation benchmarks.
arXiv Detail & Related papers (2024-03-29T17:58:50Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Look Before You Match: Instance Understanding Matters in Video Object
Segmentation [114.57723592870097]
In this paper, we argue that instance matters in video object segmentation (VOS)
We present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.
We employ well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-auged matching is further performed.
arXiv Detail & Related papers (2022-12-13T18:59:59Z) - Learning to Detect Instance-level Salient Objects Using Complementary
Image Labels [55.049347205603304]
We present the first weakly-supervised approach to the salient instance detection problem.
We propose a novel weakly-supervised network with three branches: a Saliency Detection Branch leveraging class consistency information to locate candidate objects; a Boundary Detection Branch exploiting class discrepancy information to delineate object boundaries; and a Centroid Detection Branch using subitizing information to detect salient instance centroids.
arXiv Detail & Related papers (2021-11-19T10:15:22Z) - Weakly Supervised Video Salient Object Detection [79.51227350937721]
We present the first weakly supervised video salient object detection model based on relabeled "fixation guided scribble annotations"
An "Appearance-motion fusion module" and bidirectional ConvLSTM based framework are proposed to achieve effective multi-modal learning and long-term temporal context modeling.
arXiv Detail & Related papers (2021-04-06T09:48:38Z) - Dense Relation Distillation with Context-aware Aggregation for Few-Shot
Object Detection [18.04185751827619]
Few-shot object detection is challenging since the fine-grained feature of novel object can be easily overlooked with only a few data available.
We propose Dense Relation Distillation with Context-aware Aggregation (DCNet) to tackle the few-shot detection problem.
arXiv Detail & Related papers (2021-03-30T05:34:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.