Related papers: XMem++: Production-level Video Segmentation From Few Annotated Frames

XMem++: Production-level Video Segmentation From Few Annotated Frames

URL: http://arxiv.org/abs/2307.15958v2
Date: Tue, 15 Aug 2023 11:26:36 GMT
Title: XMem++: Production-level Video Segmentation From Few Annotated Frames
Authors: Maksym Bekuzarov, Ariana Bermudez, Joon-Young Lee, Hao Li
Abstract summary: We introduce a novel semi-supervised video object segmentation (SSVOS) model, XMem++, that improves existing memory-based models. Our method can extract highly consistent results while keeping the required number of frame annotations low. We demonstrate SOTA performance on challenging (partial and multi-class) segmentation scenarios as well as long videos.
Score: 32.68978079571079
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite advancements in user-guided video segmentation, extracting complex objects consistently for highly complex scenes is still a labor-intensive task, especially for production. It is not uncommon that a majority of frames need to be annotated. We introduce a novel semi-supervised video object segmentation (SSVOS) model, XMem++, that improves existing memory-based models, with a permanent memory module. Most existing methods focus on single frame annotations, while our approach can effectively handle multiple user-selected frames with varying appearances of the same object or region. Our method can extract highly consistent results while keeping the required number of frame annotations low. We further introduce an iterative and attention-based frame suggestion mechanism, which computes the next best frame for annotation. Our method is real-time and does not require retraining after each user input. We also introduce a new dataset, PUMaVOS, which covers new challenging use cases not found in previous benchmarks. We demonstrate SOTA performance on challenging (partial and multi-class) segmentation scenarios as well as long videos, while ensuring significantly fewer frame annotations than any existing method. Project page: https://max810.github.io/xmem2-project-page/

Related papers

Enhancing Long Video Question Answering with Scene-Localized Frame Grouping [19.83545369186771]
Current Multimodal Large Language Models (MLLMs) often perform poorly in long video understanding.<n>We propose a new scenario under the video question-answering task, SceneQA.<n>We introduce a novel method called SLFG to combine individual frames into semantically coherent scene frames.
arXiv Detail & Related papers (2025-08-05T02:28:58Z)
Frame-Level Captions for Long Video Generation with Complex Multi Scenes [52.12699618126831]
We propose a novel way to annotate datasets at the frame-level.<n>This detailed guidance works with a Frame-Level Attention Mechanism to make sure text and video match precisely.<n>Our training uses Diffusion Forcing to provide the model with the ability to handle time flexibly.
arXiv Detail & Related papers (2025-05-27T07:39:43Z)
ThinkVideo: High-Quality Reasoning Video Segmentation with Chain of Thoughts [64.93416171745693]
Reasoning Video Object is a challenging task, which generates a mask sequence from an input video and an implicit, complex text query.<n>Existing works probe into the problem by finetuning Multimodal Large Language Models (MLLM) for segmentation-based output, while still falling short in difficult cases on videos given temporally-sensitive queries.<n>We propose ThinkVideo, a novel framework which leverages the zero-shot Chain-of-Thought (CoT) capability of MLLM to address these challenges.
arXiv Detail & Related papers (2025-05-24T07:01:31Z)
VidCtx: Context-aware Video Question Answering with Image Models [15.1350316858766]
We introduce VidCtx, a novel training-free VideoQA framework which integrates both visual information from input frames and textual descriptions of others frames. Experiments show that VidCtx achieves competitive performance among approaches that rely on open models.
arXiv Detail & Related papers (2024-12-23T09:26:38Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
Multi-entity Video Transformers for Fine-Grained Video Representation Learning [36.31020249963468]
We re-examine the design of transformer architectures for video representation learning. A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline. Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks.
arXiv Detail & Related papers (2023-11-17T21:23:12Z)
Learning the What and How of Annotation in Video Object Segmentation [11.012995995497029]
Video Object (VOS) is crucial for several applications, from video editing to video data generation. Traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame. We propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation.
arXiv Detail & Related papers (2023-11-08T00:56:31Z)
Learning Referring Video Object Segmentation from Weak Annotation [78.45828085350936]
Referring video object segmentation (RVOS) is a task that aims to segment the target object in all video frames based on a sentence describing the object. We propose a new annotation scheme that reduces the annotation effort by 8 times, while providing sufficient supervision for RVOS. Our scheme only requires a mask for the frame where the object first appears and bounding boxes for the rest of the frames.
arXiv Detail & Related papers (2023-08-04T06:50:52Z)
Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects. Our framework is a non-trivial adaptation from image generation methods, and is new to this field. Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z)
Per-Clip Video Object Segmentation [110.08925274049409]
Recently, memory-based approaches show promising results on semisupervised video object segmentation. We treat video object segmentation as clip-wise mask-wise propagation. We propose a new method tailored for the per-clip inference.
arXiv Detail & Related papers (2022-08-03T09:02:29Z)
Revealing Single Frame Bias for Video-and-Language Learning [115.01000652123882]
We show that a single-frame trained model can achieve better performance than existing methods that use multiple frames for training. This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets. We propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling.
arXiv Detail & Related papers (2022-06-07T16:28:30Z)
Video Instance Segmentation using Inter-Frame Communication Transformers [28.539742250704695]
Recently, the per-clip pipeline shows superior performance over per-frame methods. Previous per-clip models require heavy computation and memory usage to achieve frame-to-frame communications. We propose Inter-frame Communication Transformers (IFC), which significantly reduces the overhead for information-passing between frames.
arXiv Detail & Related papers (2021-06-07T02:08:39Z)
Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos. We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks. The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.