A2VIS: Amodal-Aware Approach to Video Instance Segmentation
- URL: http://arxiv.org/abs/2412.01147v2
- Date: Wed, 09 Apr 2025 21:26:06 GMT
- Title: A2VIS: Amodal-Aware Approach to Video Instance Segmentation
- Authors: Minh Tran, Thang Pham, Winston Bounsavy, Tri Nguyen, Ngan Le,
- Abstract summary: We propose a novel framework, Amodal-Aware Video Instance (A2VIS), which incorporates amodal representations to achieve a reliable comprehensive understanding of objects in video.<n>Amodal-Aware Video Instance (A2VIS) incorporates amodal representations to achieve a reliable comprehensive understanding of both visible and occluded parts of objects in video.
- Score: 8.082593574401704
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Handling occlusion remains a significant challenge for video instance-level tasks like Multiple Object Tracking (MOT) and Video Instance Segmentation (VIS). In this paper, we propose a novel framework, Amodal-Aware Video Instance Segmentation (A2VIS), which incorporates amodal representations to achieve a reliable and comprehensive understanding of both visible and occluded parts of objects in a video. The key intuition is that awareness of amodal segmentation through spatiotemporal dimension enables a stable stream of object information. In scenarios where objects are partially or completely hidden from view, amodal segmentation offers more consistency and less dramatic changes along the temporal axis compared to visible segmentation. Hence, both amodal and visible information from all clips can be integrated into one global instance prototype. To effectively address the challenge of video amodal segmentation, we introduce the spatiotemporal-prior Amodal Mask Head, which leverages visible information intra clips while extracting amodal characteristics inter clips. Through extensive experiments and ablation studies, we show that A2VIS excels in both MOT and VIS tasks in identifying and tracking object instances with a keen understanding of their full shape.
Related papers
- Segment Anything, Even Occluded [35.150696061791805]
SAMEO is a novel framework that adapts the Segment Anything Model (SAM) as a versatile mask decoder.
We introduce Amodal-LVIS, a large-scale synthetic dataset comprising 300K images derived from the modal LVIS and LVVIS datasets.
Our results demonstrate that our approach, when trained on the newly extended dataset, achieves remarkable zero-shot performance on both COCOA-cls and D2SA benchmarks.
arXiv Detail & Related papers (2025-03-08T16:14:57Z) - Learning Motion and Temporal Cues for Unsupervised Video Object Segmentation [49.113131249753714]
We propose an efficient algorithm, termed MTNet, which concurrently exploits motion and temporal cues.
MTNet is devised by effectively merging appearance and motion features during the feature extraction process within encoders.
We employ a cascade of decoders all feature levels across all feature levels to optimally exploit the derived features.
arXiv Detail & Related papers (2025-01-14T03:15:46Z) - SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation [4.166500345728911]
Referring Video Object (RVOS) relies on natural language expressions to segment an object in a video clip.
We build upon the Segment-Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities.
We introduce a novel adapter module that injects temporal information and multi-modal cues in the feature extraction process.
arXiv Detail & Related papers (2024-11-26T18:10:54Z) - Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT)
We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information.
Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z) - ShapeFormer: Shape Prior Visible-to-Amodal Transformer-based Amodal Instance Segmentation [11.51684042494713]
We introduce ShapeFormer, a Transformer-based model with a visible-to-amodal transition.
It facilitates the explicit relationship between output segmentations and avoids the need for amodal-to-visible transitions.
ShapeFormer comprises three key modules: Visible-Occluding Mask Head for predicting visible segmentation with occlusion awareness, (ii) Shape-Prior Amodal Mask Head for predicting amodal and occluded masks, and (iii) Category-Specific Shape Prior Retriever to provide shape prior knowledge.
arXiv Detail & Related papers (2024-03-18T00:03:48Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Audio-Visual Instance Segmentation [14.10809424760213]
We propose a new multi-modal task, termed audio-visual instance segmentation (AVIS)
AVIS aims to simultaneously identify, segment and track individual sounding object instances in audible videos.
We introduce a high-quality benchmark named AVISeg, containing over 90K instance masks from 26 semantic categories in 926 long videos.
arXiv Detail & Related papers (2023-10-28T13:37:52Z) - Coarse-to-Fine Amodal Segmentation with Shape Prior [52.38348188589834]
Amodal object segmentation is a challenging task that involves segmenting both visible and occluded parts of an object.
We propose a novel approach called Coarse-to-Fine: C2F-Seg, that addresses this problem by progressively modeling the amodal segmentation.
arXiv Detail & Related papers (2023-08-31T15:56:29Z) - Self-supervised Amodal Video Object Segmentation [57.929357732733926]
Amodal perception requires inferring the full shape of an object that is partially occluded.
This paper develops a new framework of amodal Video object segmentation (SaVos)
arXiv Detail & Related papers (2022-10-23T14:09:35Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.