SAMannot: A Memory-Efficient, Local, Open-source Framework for Interactive Video Instance Segmentation based on SAM2
- URL: http://arxiv.org/abs/2601.11301v2
- Date: Mon, 19 Jan 2026 14:39:54 GMT
- Title: SAMannot: A Memory-Efficient, Local, Open-source Framework for Interactive Video Instance Segmentation based on SAM2
- Authors: Gergely Dinya, András Gelencsér, Krisztina Kupán, Clemens Küpper, Kristóf Karacs, Anna Gelencsér-Horváth,
- Abstract summary: We present SAMannot, an open-source, local framework that integrates the Segment Anything Model 2 (SAM2) into a human-in-the-loop workflow.<n>Key features include persistent instance identity management, an automated lock-and-refine'' workflow with barrier frames, and a mask-skeletonization-based auto-prompting mechanism.
- Score: 0.12314765641075437
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current research workflows for precise video segmentation are often forced into a compromise between labor-intensive manual curation, costly commercial platforms, and/or privacy-compromising cloud-based services. The demand for high-fidelity video instance segmentation in research is often hindered by the bottleneck of manual annotation and the privacy concerns of cloud-based tools. We present SAMannot, an open-source, local framework that integrates the Segment Anything Model 2 (SAM2) into a human-in-the-loop workflow. To address the high resource requirements of foundation models, we modified the SAM2 dependency and implemented a processing layer that minimizes computational overhead and maximizes throughput, ensuring a highly responsive user interface. Key features include persistent instance identity management, an automated ``lock-and-refine'' workflow with barrier frames, and a mask-skeletonization-based auto-prompting mechanism. SAMannot facilitates the generation of research-ready datasets in YOLO and PNG formats alongside structured interaction logs. Verified through animal behavior tracking use-cases and subsets of the LVOS and DAVIS benchmark datasets, the tool provides a scalable, private, and cost-effective alternative to commercial platforms for complex video annotation tasks.
Related papers
- OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent [58.07447442040785]
We introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation.<n>Results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales.
arXiv Detail & Related papers (2026-01-12T17:55:51Z) - SAM2Auto: Auto Annotation Using FLASH [13.638155035372835]
Vision-Language Models (VLMs) lag behind Large Language Models due to the scarcity of annotated datasets.<n>We introduce SAM2Auto, the first fully automated annotation pipeline for video datasets requiring no human intervention or dataset-specific training.<n>Our system employs statistical approaches to minimize detection errors while ensuring consistent object tracking throughout entire video sequences.
arXiv Detail & Related papers (2025-06-09T15:15:15Z) - DPO Learning with LLMs-Judge Signal for Computer Use Agents [9.454381108993832]
Computer use agents (CUA) are systems that automatically interact with graphical user interfaces (GUIs) to complete tasks.<n>We develop a lightweight vision-language model that runs entirely on local machines.
arXiv Detail & Related papers (2025-06-03T17:27:04Z) - SAMWISE: Infusing Wisdom in SAM2 for Text-Driven Video Segmentation [4.166500345728911]
Referring Video Object (RVOS) relies on natural language expressions to segment an object in a video clip.<n>We build upon the Segment-Anything 2 (SAM2) model, that provides robust segmentation and tracking capabilities.<n>We introduce a novel adapter module that injects temporal information and multi-modal cues in the feature extraction process.
arXiv Detail & Related papers (2024-11-26T18:10:54Z) - Adapting Segment Anything Model for Unseen Object Instance Segmentation [70.60171342436092]
Unseen Object Instance (UOIS) is crucial for autonomous robots operating in unstructured environments.
We propose UOIS-SAM, a data-efficient solution for the UOIS task.
UOIS-SAM integrates two key components: (i) a Heatmap-based Prompt Generator (HPG) to generate class-agnostic point prompts with precise foreground prediction, and (ii) a Hierarchical Discrimination Network (HDNet) that adapts SAM's mask decoder.
arXiv Detail & Related papers (2024-09-23T19:05:50Z) - FocSAM: Delving Deeply into Focused Objects in Segmenting Anything [58.042354516491024]
The Segment Anything Model (SAM) marks a notable milestone in segmentation models.
We propose FocSAM with a pipeline redesigned on two pivotal aspects.
First, we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object.
Second, we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks.
arXiv Detail & Related papers (2024-05-29T02:34:13Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - MVSA-Net: Multi-View State-Action Recognition for Robust and Deployable Trajectory Generation [6.032808648673282]
The learn-from-observation (LfO) paradigm is a human-inspired mode for a robot to learn to perform a task simply by watching it being performed.
We present multi-view SA-Net, which generalizes the SA-Net model to allow the perception of multiple viewpoints of the task activity.
arXiv Detail & Related papers (2023-11-14T18:53:28Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.