Related papers: The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA

The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA

URL: http://arxiv.org/abs/2509.16972v2
Date: Mon, 20 Oct 2025 04:36:14 GMT
Title: The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA
Authors: Quanzhu Niu, Dengxian Gong, Shihao Chen, Tao Zhang, Yikang Zhou, Haobo Yuan, Lu Qi, Xiangtai Li, Shunping Ji,
Abstract summary: Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions.<n>We propose Augmented and Selective Averaged Sa2VA (SaSa2VA) to address these issues.<n>SaSa2VA achieves a $mathcalJ&F$ of 67.45, ranking first and surpassing the runner-up by 2.80 points.
Score: 57.26038712231443
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA (SaSaSa2VA) to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a $\mathcal{J\&F}$ of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/bytedance/Sa2VA.

Related papers

Evaluating SAM2 for Video Semantic Segmentation [60.157605818225186]
The Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos.<n>This paper explores the extension of SAM2 to dense Video Semantic (VSS)<n>Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
arXiv Detail & Related papers (2025-12-01T15:15:16Z)
3rd Place Report of LSVOS 2025 MeViS Track: Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference [59.989927043461364]
We find that Sa2VA does not perform according to its full potential for referring video object segmentation tasks.<n>We propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results.<n>With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark.
arXiv Detail & Related papers (2025-09-23T14:38:25Z)
Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track [11.068687286561177]
Referential Video Object (RVOS) aims to segment all objects in a video that match a given natural language description.<n>We present a training-free framework that substantially improves Sa2VA's performance on the RVOS task.
arXiv Detail & Related papers (2025-09-19T03:01:27Z)
Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching [57.4215496482743]
Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions.<n>We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy.<n>Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods across diverse benchmarks, demonstrating superior robustness and accuracy.
arXiv Detail & Related papers (2025-04-18T14:19:07Z)
4th PVUW MeViS 3rd Place Report: Sa2VA [105.88675577642204]
We show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS.<n>In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos.
arXiv Detail & Related papers (2025-04-01T07:06:47Z)
EdgeTAM: On-Device Track Anything Model [65.10032957471824]
Segment Anything Model (SAM) 2 further extends its capability from image to video inputs through a memory bank mechanism.<n>We aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance.<n>We propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost.
arXiv Detail & Related papers (2025-01-13T12:11:07Z)
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [110.3379755761583]
Sa2VA is a unified model for grounded understanding of both images and videos.<n>It supports a wide range of image and video tasks, including referring segmentation and conversation.<n>We show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation.
arXiv Detail & Related papers (2025-01-07T18:58:54Z)
Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track [28.52754012142431]
Segment Anything Model 2 (SAM 2) is a foundation model towards solving promptable visual segmentation in images and videos. SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date. Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.
arXiv Detail & Related papers (2024-08-19T16:13:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.