Related papers: IDPro: Flexible Interactive Video Object Segmentation by ID-queried Concurrent Propagation

IDPro: Flexible Interactive Video Object Segmentation by ID-queried Concurrent Propagation

URL: http://arxiv.org/abs/2401.12480v3
Date: Fri, 07 Feb 2025 15:57:40 GMT
Title: IDPro: Flexible Interactive Video Object Segmentation by ID-queried Concurrent Propagation
Authors: Kexin Li, Tao Jiang, Zongxin Yang, Yi Yang, Yueting Zhuang, Jun Xiao,
Abstract summary: We propose a framework that can accept multiple frames simultaneously and explore synergistic interaction across frames (SIAF)<n>Our SwinB-SIAF achieves new state-of-the-art performance on DAVIS 2017 (89.6%, J&F@60)<n>Our R50-SIAF is more than 3 faster than the state-of-the-art competitor under challenging multi-object scenarios.
Score: 66.94214242968967
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interactive Video Object Segmentation (iVOS) is a challenging task that requires real-time human-computer interaction. To improve the user experience, it is important to consider the user's input habits, segmentation quality, running time and memory consumption.However, existing methods compromise user experience with single input mode and slow running speed. Specifically, these methods only allow the user to interact with one single frame, which limits the expression of the user's intent.To overcome these limitations and better align with people's usage habits, we propose a framework that can accept multiple frames simultaneously and explore synergistic interaction across frames (SIAF). Concretely, we designed the Across-Frame Interaction Module that enables users to annotate different objects freely on multiple frames. The AFI module will migrate scribble information among multiple interactive frames and generate multi-frame masks. Additionally, we employ the id-queried mechanism to process multiple objects in batches. Furthermore, for a more efficient propagation and lightweight model, we design a truncated re-propagation strategy to replace the previous multi-round fusion module, which employs an across-round memory that stores important interaction information. Our SwinB-SIAF achieves new state-of-the-art performance on DAVIS 2017 (89.6%, J&F@60). Moreover, our R50-SIAF is more than 3 faster than the state-of-the-art competitor under challenging multi-object scenarios.

Related papers

MAIS: Memory-Attention for Interactive Segmentation [0.8678845273264675]
Vision Transformer (ViT)-based models achieve state-of-the-art performance using user clicks and prior masks as prompts.<n>Existing methods treat interactions as independent events, leading to redundant corrections and limited refinement gains.<n>We address this by introducing Memory-Attention mechanism for Interactive that stores past user inputs and segmentation states, enabling temporal context integration.
arXiv Detail & Related papers (2025-05-12T12:48:27Z)
Framer: Interactive Frame Interpolation [73.06734414930227]
Framer targets producing smoothly transitioning frames between two images as per user creativity. Our approach supports customizing the transition process by tailoring the trajectory of some selected keypoints. It is noteworthy that our system also offers an "autopilot" mode, where we introduce a module to estimate the keypoints and the trajectory automatically.
arXiv Detail & Related papers (2024-10-24T17:59:51Z)
Training-Free Robust Interactive Video Object Segmentation [82.05906654403684]
We propose a training-free prompt tracking framework for interactive video object segmentation (I-PT) We jointly adopt sparse points and boxes tracking, filtering out unstable points and capturing object-wise information. Our framework has demonstrated robust zero-shot video segmentation results on popular VOS datasets.
arXiv Detail & Related papers (2024-06-08T14:25:57Z)
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID. Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z)
DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer [58.95404214273222]
Most state-of-the-art instance segmentation methods rely on large amounts of pixel-precise ground-truth for training. We introduce a more efficient approach, called DynaMITe, in which we represent user interactions as-temporal queries. Our architecture also alleviates any need to re-compute image features during refinement, and requires fewer interactions for segmenting multiple instances in a single image.
arXiv Detail & Related papers (2023-04-13T16:57:02Z)
Revisiting Click-based Interactive Video Object Segmentation [24.114405100879278]
CiVOS builds on de-coupled modules reflecting user interaction and mask propagation. The approach is extensively evaluated on the popular interactiveDAVIS dataset. The presented CiVOS pipeline achieves competitive results, although requiring a lower user workload.
arXiv Detail & Related papers (2022-03-03T15:55:14Z)
Modular Interactive Video Object Segmentation: Interaction-to-Mask, Propagation and Difference-Aware Fusion [68.45737688496654]
We present a modular interactive VOS framework which decouples interaction-to-mask and mask propagation. We show that our method outperforms current state-of-the-art algorithms while requiring fewer frame interactions.
arXiv Detail & Related papers (2021-03-14T14:39:08Z)
Multi-Stage Fusion for One-Click Segmentation [20.00726292545008]
We propose a new multi-stage guidance framework for interactive segmentation. Our proposed framework has a negligible increase in parameter count compared to early-fusion frameworks.
arXiv Detail & Related papers (2020-10-19T17:07:40Z)
Memory Aggregation Networks for Efficient Interactive Video Object Segmentation [75.35173388837852]
Interactive video object segmentation (iVOS) aims at efficiently harvesting high-quality segmentation masks of the target object in a video with user interactions. Most previous state-of-the-arts tackle the iVOS with two independent networks for conducting user interaction and temporal propagation, respectively. We propose a unified framework, named Memory Aggregation Networks (MA-Net), to address the challenging iVOS in a more efficient way.
arXiv Detail & Related papers (2020-03-30T07:25:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.