Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
- URL: http://arxiv.org/abs/2501.04001v3
- Date: Mon, 03 Nov 2025 17:35:29 GMT
- Title: Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
- Authors: Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang,
- Abstract summary: Sa2VA is a comprehensive, unified model for dense grounded understanding of both images and videos.<n>It supports a wide range of image and video tasks, including referring segmentation and conversation.<n>Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL.
- Score: 126.02606196101259
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with MLLM, the advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model performance. We also manually validate 2k video objects in the Ref-SAV datasets to benchmark referring video object segmentation in complex environments. Experiments show that Sa2VA achieves strong performance across multiple tasks, particularly in referring video object segmentation, highlighting its potential for complex real-world applications. In addition, Sa2VA can be easily extended into various VLMs, including Qwen-VL and Intern-VL, which can be updated with rapid process in current open-sourced VLMs. Code and models have been provided to the community.
Related papers
- LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding [23.207637210563504]
LiViBench is an omnimodal benchmark for interactive livestream videos.<n>It features a diverse set of 24 tasks, highlighting the perceptual, reasoning, and livestream-specific challenges.<n>We develop LiVi-LLM-7B, an MLLM with enhanced knowledge of interactive livestreams.
arXiv Detail & Related papers (2026-01-21T14:14:20Z) - TRANSPORTER: Transferring Visual Semantics from VLM Manifolds [56.749972238005604]
This paper introduces a logits-to-video (L2V) task alongside a model-independent approach, TRANSPORTER, to generate videos.<n> TRANSPORTER learns an optimal transport coupling to VLM's high-semantic embedding spaces.<n>In turn, logit scores define embedding directions for conditional video generation.
arXiv Detail & Related papers (2025-11-23T09:12:48Z) - The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA [57.26038712231443]
Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions.<n>We propose Augmented and Selective Averaged Sa2VA (SaSa2VA) to address these issues.<n>SaSa2VA achieves a $mathcalJ&F$ of 67.45, ranking first and surpassing the runner-up by 2.80 points.
arXiv Detail & Related papers (2025-09-21T08:08:17Z) - Enhancing Sa2VA for Referent Video Object Segmentation: 2nd Solution for 7th LSVOS RVOS Track [11.068687286561177]
Referential Video Object (RVOS) aims to segment all objects in a video that match a given natural language description.<n>We present a training-free framework that substantially improves Sa2VA's performance on the RVOS task.
arXiv Detail & Related papers (2025-09-19T03:01:27Z) - Correspondence as Video: Test-Time Adaption on SAM2 for Reference Segmentation in the Wild [38.94246183524246]
We propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video.<n>This perspective allows the latest version of SAM, known as SAM2, to be adapted to downstream tasks in a lightweight manner.<n>We term this approach Correspondence As Video for SAM (CAV-SAM)
arXiv Detail & Related papers (2025-08-11T08:42:49Z) - SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes [30.870903750545004]
We introduce a novel framework, termed SAM2-LOVE, which integrates textual, audio, and visual representations into a learnable token.<n>Technically, our approach includes a multimodal fusion module aimed at improving multimodal understanding of SAM2.<n>We conducted extensive experiments to demonstrate that SAM2-LOVE outperforms the SOTA by 8.5% in $calmathJ&F$ on the Ref-AVS benchmark.
arXiv Detail & Related papers (2025-06-02T11:36:25Z) - DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency [91.30252180093333]
We propose the Dual Consistency SAM (DCSAM) method based on prompttuning to adapt SAM and SAM2 for in-context segmentation.
Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts.
Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support SAM2.
arXiv Detail & Related papers (2025-04-16T13:41:59Z) - 4th PVUW MeViS 3rd Place Report: Sa2VA [105.88675577642204]
We show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS.
In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos.
arXiv Detail & Related papers (2025-04-01T07:06:47Z) - MPG-SAM 2: Adapting SAM 2 with Mask Priors and Global Context for Referring Video Object Segmentation [21.43947114468122]
Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions.
The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks.
We propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges.
arXiv Detail & Related papers (2025-01-23T13:53:33Z) - Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM [28.64108439552772]
We introduce a large-scale synthetic dataset created from proprietary models.<n>We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance.<n>Our proposed model achieves state-of-the-art results across various video tasks and shows impressive generalization.
arXiv Detail & Related papers (2024-12-12T18:20:41Z) - TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents.
For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data.
In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z) - VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos [58.765796160750504]
VideoGLaMM is a new model for fine-grained pixel-level grounding in videos based on user-provided textual inputs.
The architecture is trained to synchronize both spatial and temporal elements of video content with textual instructions.
Experimental results show that our model consistently outperforms existing approaches across all three tasks.
arXiv Detail & Related papers (2024-11-07T17:59:27Z) - Semantic Alignment for Multimodal Large Language Models [72.10272479476161]
We introduce Semantic Alignment for Multi-modal large language models (SAM)
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
arXiv Detail & Related papers (2024-08-23T06:48:46Z) - Video Object Segmentation via SAM 2: The 4th Solution for LSVOS Challenge VOS Track [28.52754012142431]
Segment Anything Model 2 (SAM 2) is a foundation model towards solving promptable visual segmentation in images and videos.
SAM 2 builds a data engine, which improves model and data via user interaction, to collect the largest video segmentation dataset to date.
Without fine-tuning on the training set, SAM 2 achieved 75.79 J&F on the test set and ranked 4th place for 6th LSVOS Challenge VOS Track.
arXiv Detail & Related papers (2024-08-19T16:13:14Z) - VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)<n>It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z) - Video-LLaVA: Learning United Visual Representation by Alignment Before Projection [27.04277811443469]
Video-LLaVA learns from a mixed dataset of images and videos, mutually enhancing each other.
Video-LLaVA superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits.
arXiv Detail & Related papers (2023-11-16T10:59:44Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.