The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation
- URL: http://arxiv.org/abs/2504.05178v1
- Date: Mon, 07 Apr 2025 15:24:54 GMT
- Title: The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation
- Authors: Hao Fang, Runmin Cong, Xiankai Lu, Zhiyang Chen, Wei Zhang,
- Abstract summary: We propose a simple and effective inference optimization method to fully unleash the potential of LMMs in referring video segmentation.<n>Our solution achieved 61.98% J&F on the MeViS test set and ranked 1st place in the 4th PVUW Challenge MeViS Track at CVPR 2025.
- Score: 31.44879457190659
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motion expression video segmentation is designed to segment objects in accordance with the input motion expressions. In contrast to the conventional Referring Video Object Segmentation (RVOS), it places emphasis on motion as well as multi-object expressions, making it more arduous. Recently, Large Multimodal Models (LMMs) have begun to shine in RVOS due to their powerful vision-language perception capabilities. In this work, we propose a simple and effective inference optimization method to fully unleash the potential of LMMs in referring video segmentation. Firstly, we use Sa2VA as our baseline, which is a unified LMM for dense grounded understanding of both images and videos. Secondly, we uniformly sample the video frames during the inference process to enhance the model's understanding of the entire video. Finally, we integrate the results of multiple expert models to mitigate the erroneous predictions of a single model. Our solution achieved 61.98% J&F on the MeViS test set and ranked 1st place in the 4th PVUW Challenge MeViS Track at CVPR 2025.
Related papers
- IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs [36.76252153495239]
IV-Bench is the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning.
IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks.
arXiv Detail & Related papers (2025-04-21T19:53:44Z) - 4th PVUW MeViS 3rd Place Report: Sa2VA [105.88675577642204]
We show that with a simple modification to the test time inference method on stronger MLLMs, we can lead to stronger results on MeVIS.<n>In particular, we adopt the recent method Sa2VA, a unified model for dense grounded understanding of both images and videos.
arXiv Detail & Related papers (2025-04-01T07:06:47Z) - BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding [51.49345400300556]
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks.<n>Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content.<n>We introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
arXiv Detail & Related papers (2025-03-27T13:18:40Z) - ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We present ViLLa: Video reasoning segmentation with Large Language Model.<n>Our ViLLa manages to tackle these challenges through multiple core innovations.<n>To enable efficient processing of long videos, ViLLa incorporates (3) a key segment sampler that adaptively partitions long videos into shorter but semantically dense segments for less redundancy.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - DVOS: Self-Supervised Dense-Pattern Video Object Segmentation [6.092973123903838]
In Dense Video Object (DVOS) scenarios, each video frame encompasses hundreds of small, dense and partially occluded objects.
We propose a semi-self-temporal approach for DVOS utilizing a diffusion-based method through multi-task learning.
To demonstrate the utility and efficacy of the proposed approach, we developed DVOS models for wheat head segmentation of handheld and drone-captured videos.
arXiv Detail & Related papers (2024-06-07T17:58:36Z) - Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward [118.65089648651308]
This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content.
We show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video Question Answering (QA) tasks.
arXiv Detail & Related papers (2024-04-01T17:28:16Z) - Video-LLaVA: Learning United Visual Representation by Alignment Before Projection [27.04277811443469]
Video-LLaVA learns from a mixed dataset of images and videos, mutually enhancing each other.
Video-LLaVA superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits.
arXiv Detail & Related papers (2023-11-16T10:59:44Z) - An Empirical Study of End-to-End Video-Language Transformers with Masked
Visual Modeling [152.75131627307567]
Masked visual modeling (MVM) has been recently proven effective for visual pre-training.
We systematically examine the potential of MVM in the context of VidL learning.
We show VIOLETv2 pre-trained with MVM achieves notable improvements on 13 VidL benchmarks.
arXiv Detail & Related papers (2022-09-04T06:30:32Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.