Related papers: CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation

CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation

URL: http://arxiv.org/abs/2601.12076v1
Date: Sat, 17 Jan 2026 14:52:46 GMT
Title: CroBIM-V: Memory-Quality Controlled Remote Sensing Referring Video Object Segmentation
Authors: H. Jiang, Y. Sun, Z. Dong, T. Liu, Y. Gu,
Abstract summary: This paper advances RS-RVOS research through dual contributions in data and methodology.<n>First, we construct RS-RVOS Bench, the first large-scale benchmark comprising 111 video sequences, about 25,000 frames, and 213,000 temporal referring annotations.<n>Second, we propose a memory-quality-aware online referring segmentation framework, termed Memory Quality Control with Segment Anything Model (MQC-SAM)
Score: 0.3099118620919279
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Remote sensing video referring object segmentation (RS-RVOS) is challenged by weak target saliency and severe visual information truncation in dynamic scenes, making it extremely difficult to maintain discriminative target representations during segmentation. Moreover, progress in this field is hindered by the absence of large-scale dedicated benchmarks, while existing models are often affected by biased initial memory construction that impairs accurate instance localization in complex scenarios, as well as indiscriminate memory accumulation that encodes noise from occlusions or misclassifications, leading to persistent error propagation. This paper advances RS-RVOS research through dual contributions in data and methodology. First, we construct RS-RVOS Bench, the first large-scale benchmark comprising 111 video sequences, about 25,000 frames, and 213,000 temporal referring annotations. Unlike common RVOS benchmarks where many expressions are written with access to the full video context, our dataset adopts a strict causality-aware annotation strategy in which linguistic references are generated solely from the target state in the initial frame. Second, we propose a memory-quality-aware online referring segmentation framework, termed Memory Quality Control with Segment Anything Model (MQC-SAM). MQC-SAM introduces a temporal motion consistency module for initial memory calibration, leveraging short-term motion trajectory priors to correct structural deviations and establish accurate memory anchoring. Furthermore, it incorporates a decoupled attention-based memory integration mechanism with dynamic quality assessment, selectively updating high-confidence semantic features while filtering unreliable information, thereby effectively preventing error accumulation and propagation. Extensive experiments on RS-RVOS Bench demonstrate that MQC-SAM achieves state-of-the-art performance.

Related papers

Fast SAM2 with Text-Driven Token Pruning [52.8350457627401]
Segment Anything Model 2 (SAM2), a vision computation model has significantly advanced in prompt-driven video object segmentation.<n>SAM2 pipelines propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object.<n>We introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation.
arXiv Detail & Related papers (2025-12-24T18:59:05Z)
SAMITE: Position Prompted SAM2 with Calibrated Memory for Visual Object Tracking [58.35852822355312]
Visual Object Tracking (VOT) is widely used in applications like autonomous driving to continuously track targets in videos.<n>To address these issues, some methods propose to adapt the video foundation model SAM2 for VOT, where the tracking results of each frame would be encoded as memory for conditioning the rest of frames in an autoregressive manner.<n>We present a SAMITE model, built upon SAM2 with additional modules, to tackle these challenges.
arXiv Detail & Related papers (2025-07-29T12:11:56Z)
HQ-SMem: Video Segmentation and Tracking Using Memory Efficient Object Embedding With Selective Update and Self-Supervised Distillation Feedback [0.0]
We introduce HQ-SMem, for High Quality video segmentation and tracking using Smart Memory.<n>Our approach incorporates three key innovations: (i) leveraging SAM with High-Quality masks (SAM-HQ) alongside appearance-based candidate-selection to refine coarse segmentation masks, resulting in improved object boundaries; (ii) implementing a dynamic smart memory mechanism that selectively stores relevant key frames while discarding redundant ones; and (iii) dynamically updating the appearance model to effectively handle complex topological object variations and reduce drift throughout the video.
arXiv Detail & Related papers (2025-07-25T03:28:05Z)
MoSAM: Motion-Guided Segment Anything Model with Spatial-Temporal Memory Selection [21.22536962888316]
We present MoSAM, incorporating two key strategies to integrate object motion cues into the model and establish more reliable feature memory.<n>MoSAM achieves state-of-the-art results compared to other competitors.
arXiv Detail & Related papers (2025-04-30T02:19:31Z)
TF-SASM: Training-free Spatial-aware Sparse Memory for Multi-object Tracking [6.91631684487121]
Multi-object tracking (MOT) in computer vision remains a significant challenge, requiring precise localization and continuous tracking of multiple objects in video sequences. We propose a novel memory-based approach that selectively stores critical features based on object motion and overlapping awareness. Our approach significantly improves over MOTRv2 in the DanceTrack test set, demonstrating a gain of 2.0% AssA score and 2.1% in IDF1 score.
arXiv Detail & Related papers (2024-07-05T07:55:19Z)
Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation. Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z)
Learning Quality-aware Dynamic Memory for Video Object Segmentation [32.06309833058726]
We propose a Quality-aware Dynamic Memory Network (QDMN) to evaluate the segmentation quality of each frame. Our QDMN achieves new state-of-the-art performance on both DAVIS and YouTube-VOS benchmarks.
arXiv Detail & Related papers (2022-07-16T12:18:04Z)
Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size. We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos. We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z)
Target-Aware Object Discovery and Association for Unsupervised Video Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation. We introduce a novel approach for more accurate and efficient unseen-temporal segmentation. We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z)
Video Object Segmentation with Episodic Graph Memory Networks [198.74780033475724]
A graph memory network is developed to address the novel idea of "learning to update the segmentation model" We exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges. The proposed graph memory network yields a neat yet principled framework, which can generalize well both one-shot and zero-shot video object segmentation tasks.
arXiv Detail & Related papers (2020-07-14T13:19:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.