The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA
- URL: http://arxiv.org/abs/2509.16972v2
- Date: Mon, 20 Oct 2025 04:36:14 GMT
- Title: The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA
- Authors: Quanzhu Niu, Dengxian Gong, Shihao Chen, Tao Zhang, Yikang Zhou, Haobo Yuan, Lu Qi, Xiangtai Li, Shunping Ji,
- Abstract summary: Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions.<n>We propose Augmented and Selective Averaged Sa2VA (SaSa2VA) to address these issues.<n>SaSa2VA achieves a $mathcalJ&F$ of 67.45, ranking first and surpassing the runner-up by 2.80 points.
- Score: 57.26038712231443
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA (SaSaSa2VA) to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a $\mathcal{J\&F}$ of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: https://github.com/bytedance/Sa2VA.
Related papers
- Evaluating SAM2 for Video Semantic Segmentation [60.157605818225186]
The Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos.<n>This paper explores the extension of SAM2 to dense Video Semantic (VSS)<n>Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
arXiv Detail & Related papers (2025-12-01T15:15:16Z) - 3rd Place Report of LSVOS 2025 MeViS Track: Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference [59.989927043461364]
We find that Sa2VA does not perform according to its full potential for referring video object segmentation tasks.<n>We propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results.<n>With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark.
arXiv Detail & Related papers (2025-09-23T14:38:25Z) - Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track [11.068687286561177]
Referential Video Object (RVOS) aims to segment all objects in a video that match a given natural language description.<n>We present a training-free framework that substantially improves Sa2VA's performance on the RVOS task.
arXiv Detail & Related papers (2025-09-19T03:01:27Z) - Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching [57.4215496482743]
Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions.<n>We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy.<n>Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods across diverse benchmarks, demonstrating superior robustness and accuracy.
arXiv Detail & Related papers (2025-04-18T14:19:07Z) - 4th PVUW MeViS 3rd Place Report: Sa2VA [105.88675577642204]
We show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS.<n>In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos.
arXiv Detail & Related papers (2025-04-01T07:06:47Z) - EdgeTAM: On-Device Track Anything Model [65.10032957471824]
Segment Anything Model (SAM) 2 further extends its capability from image to video inputs through a memory bank mechanism.<n>We aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance.<n>We propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost.
arXiv Detail & Related papers (2025-01-13T12:11:07Z) - Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [110.3379755761583]
Sa2VA is a unified model for grounded understanding of both images and videos.<n>It supports a wide range of image and video tasks, including referring segmentation and conversation.<n>We show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation.
arXiv Detail & Related papers (2025-01-07T18:58:54Z) - Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track [28.52754012142431]
Segment Anything Model 2 (SAM 2) is a foundation model towards solving promptable visual segmentation in images and videos.
SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.
Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.
arXiv Detail & Related papers (2024-08-19T16:13:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.