3rd Place Report of LSVOS 2025 MeViS Track: Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference
- URL: http://arxiv.org/abs/2509.19082v1
- Date: Tue, 23 Sep 2025 14:38:25 GMT
- Title: 3rd Place Report of LSVOS 2025 MeViS Track: Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference
- Authors: Alexey Nekrasov, Ali Athar, Daan de Geus, Alexander Hermans, Bastian Leibe,
- Abstract summary: We find that Sa2VA does not perform according to its full potential for referring video object segmentation tasks.<n>We propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results.<n>With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark.
- Score: 59.989927043461364
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sa2VA is a recent model for language-guided dense grounding in images and video that achieves state-of-the-art results on multiple segmentation benchmarks and that has become widely popular. However, we found that Sa2VA does not perform according to its full potential for referring video object segmentation tasks. We identify inconsistencies between training and inference procedures as the key factor holding it back. To mitigate this issue, we propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results. In fact, Sa2VA-i sets a new state of the art for multiple video benchmarks and achieves improvements of up to +11.6 J&F on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS and +4.1 on ReVOS using the same Sa2VA checkpoints. With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark. We hope that this work will show the importance of seemingly trivial implementation details and that it will provide valuable insights for the referring video segmentation field. We provide the code and updated models at https://github.com/kumuji/sa2va-i
Related papers
- The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA [57.26038712231443]
Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions.<n>Building on Sa2VA, we identify two key bottlenecks that limit segmentation performance.<n>We propose Augmented and Selective Averaged Sa2VA SaSa2VA to address these issues.
arXiv Detail & Related papers (2025-09-21T08:08:17Z) - Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track [11.068687286561177]
Referential Video Object (RVOS) aims to segment all objects in a video that match a given natural language description.<n>We present a training-free framework that substantially improves Sa2VA's performance on the RVOS task.
arXiv Detail & Related papers (2025-09-19T03:01:27Z) - 4th PVUW MeViS 3rd Place Report: Sa2VA [105.88675577642204]
We show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS.<n>In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos.
arXiv Detail & Related papers (2025-04-01T07:06:47Z) - EdgeTAM: On-Device Track Anything Model [65.10032957471824]
Segment Anything Model (SAM) 2 further extends its capability from image to video inputs through a memory bank mechanism.<n>We aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance.<n>We propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost.
arXiv Detail & Related papers (2025-01-13T12:11:07Z) - Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos [110.3379755761583]
Sa2VA is a unified model for grounded understanding of both images and videos.<n>It supports a wide range of image and video tasks, including referring segmentation and conversation.<n>We show that Sa2VA achieves state-of-the-art across multiple tasks, particularly in referring video object segmentation.
arXiv Detail & Related papers (2025-01-07T18:58:54Z) - TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models [32.6243916760583]
We present a framework for measuring two core capabilities of video comprehension: appearance and motion understanding.
We introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos.
Our model shows a 4.6%p improvement compared to V-JEPA (ViT-L) and a 7.7%p improvement compared to UMT (ViT-L)
arXiv Detail & Related papers (2024-08-21T03:56:27Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.