Propagating Semantic Labels in Video Data
- URL: http://arxiv.org/abs/2310.00783v1
- Date: Sun, 1 Oct 2023 20:32:26 GMT
- Title: Propagating Semantic Labels in Video Data
- Authors: David Balaban, Justin Medich, Pranay Gosar, Justin Hart
- Abstract summary: This work presents a method for performing segmentation for objects in video.
Once an object has been found in a frame of video, the segment can then be propagated to future frames.
The method works by combining SAM with Structure from Motion.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Semantic Segmentation combines two sub-tasks: the identification of
pixel-level image masks and the application of semantic labels to those masks.
Recently, so-called Foundation Models have been introduced; general models
trained on very large datasets which can be specialized and applied to more
specific tasks. One such model, the Segment Anything Model (SAM), performs
image segmentation. Semantic segmentation systems such as CLIPSeg and MaskRCNN
are trained on datasets of paired segments and semantic labels. Manual labeling
of custom data, however, is time-consuming. This work presents a method for
performing segmentation for objects in video. Once an object has been found in
a frame of video, the segment can then be propagated to future frames; thus
reducing manual annotation effort. The method works by combining SAM with
Structure from Motion (SfM). The video input to the system is first
reconstructed into 3D geometry using SfM. A frame of video is then segmented
using SAM. Segments identified by SAM are then projected onto the the
reconstructed 3D geometry. In subsequent video frames, the labeled 3D geometry
is reprojected into the new perspective, allowing SAM to be invoked fewer
times. System performance is evaluated, including the contributions of the SAM
and SfM components. Performance is evaluated over three main metrics:
computation time, mask IOU with manual labels, and the number of tracking
losses. Results demonstrate that the system has substantial computation time
improvements over human performance for tracking objects over video frames, but
suffers in performance.
Related papers
- SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree [79.26409013413003]
We introduce SAM2Long, an improved training-free video object segmentation strategy.
It considers the segmentation uncertainty within each frame and chooses the video-level optimal results from multiple segmentation pathways.
SAM2Long achieves an average improvement of 3.0 points across all 24 head-to-head comparisons.
arXiv Detail & Related papers (2024-10-21T17:59:19Z) - Segment Any Mesh: Zero-shot Mesh Part Segmentation via Lifting Segment Anything 2 to 3D [1.6427658855248815]
We propose Segment Any Mesh (SAMesh), a novel zero-shot method for mesh part segmentation.
SAMesh operates in two phases: multimodal rendering and 2D-to-3D lifting.
We compare our method with a robust, well-evaluated shape analysis method, ShapeDiam, and show our method is comparable to or exceeds its performance.
arXiv Detail & Related papers (2024-08-24T22:05:04Z) - Moving Object Segmentation: All You Need Is SAM (and Flow) [82.78026782967959]
We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects.
In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt.
These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks.
arXiv Detail & Related papers (2024-04-18T17:59:53Z) - SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in
Videos by Prompt Denoising [37.216493829454706]
We explore the potential of applying the Segment Anything Model to track and segment objects in videos.
Specifically, we iteratively propagate the bounding box of each object's mask in the preceding frame as the prompt for the next frame.
To enhance SAM's denoising capability against position and size variations, we propose a multi-prompt strategy.
arXiv Detail & Related papers (2024-03-07T03:52:59Z) - SAI3D: Segment Any Instance in 3D Scenes [68.57002591841034]
We introduce SAI3D, a novel zero-shot 3D instance segmentation approach.
Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations.
Empirical evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach.
arXiv Detail & Related papers (2023-12-17T09:05:47Z) - A One Stop 3D Target Reconstruction and multilevel Segmentation Method [0.0]
We propose an open-source one stop 3D target reconstruction and multilevel segmentation framework (OSTRA)
OSTRA performs segmentation on 2D images, tracks multiple instances with segmentation labels in the image sequence, and then reconstructs labelled 3D objects or multiple parts with Multi-View Stereo (MVS) or RGBD-based 3D reconstruction methods.
Our method opens up a new avenue for reconstructing 3D targets embedded with rich multi-scale segmentation information in complex scenes.
arXiv Detail & Related papers (2023-08-14T07:12:31Z) - Segment Anything Meets Point Tracking [116.44931239508578]
This paper presents a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking.
We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark.
Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions.
arXiv Detail & Related papers (2023-07-03T17:58:01Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Prototypical Cross-Attention Networks for Multiple Object Tracking and
Segmentation [95.74244714914052]
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes.
We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online.
PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
arXiv Detail & Related papers (2021-06-22T17:57:24Z) - Revisiting Sequence-to-Sequence Video Object Segmentation with
Multi-Task Loss and Skip-Memory [4.343892430915579]
Video Object (VOS) is an active research area of the visual domain.
Current approaches lose objects in longer sequences, especially when the object is small or briefly occluded.
We build upon a sequence-to-sequence approach that employs an encoder-decoder architecture together with a memory module for exploiting the sequential data.
arXiv Detail & Related papers (2020-04-25T15:38:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.