MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
- URL: http://arxiv.org/abs/2510.09274v1
- Date: Fri, 10 Oct 2025 11:18:21 GMT
- Title: MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
- Authors: Ming Dai, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang,
- Abstract summary: Referring Video Object (RefVOS) seeks to segment target objects in videos guided by natural language descriptions.<n>We propose a unified framework that jointly optimize Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability.
- Score: 40.37010049965347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg
Related papers
- SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM [36.28285195488772]
Large language models (LLMs) have demonstrated exceptional capabilities in text understanding.<n>Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information.<n>This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding.
arXiv Detail & Related papers (2026-02-03T14:39:16Z) - Temporal Prompting Matters: Rethinking Referring Video Object Segmentation [64.82333675385802]
Referring Video Object (RVOS) aims to segment the object referred to by the query sentence in the video.<n>Most existing methods require end-to-end training with dense mask annotations.<n>We propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors.
arXiv Detail & Related papers (2025-10-08T17:59:57Z) - Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z) - DATE: Dynamic Absolute Time Enhancement for Long Video Understanding [8.720269393713451]
Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs)<n>We propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs.<n>We introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage.
arXiv Detail & Related papers (2025-09-11T08:49:22Z) - TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding [12.57589714067771]
Video Temporal Grounding aims to extract relevant video segments based on a given natural language query.<n>Previous approaches suffer from semantic fragmentation, where temporally continuous frames sharing the same semantics are split across multiple segments.<n>We propose a simple yet effective Temporal-Aware approach for zero-shot video temporal Grounding, which incorporates temporal pooling, temporal coherence clustering, and similarity adjustment.
arXiv Detail & Related papers (2025-08-11T12:38:46Z) - Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding [18.027290155746112]
Temporal Search is a training-free framework that enables MLLMs to explore temporal regions for improved long video understanding iteratively.<n>It is based on a key observation: the model's generation confidence across different temporal intervals is highly correlated with prediction accuracy.<n>It refines the focus of the model by iteratively shifting attention to more fine-grained temporal intervals, improving its understanding of long videos.
arXiv Detail & Related papers (2025-06-28T15:24:05Z) - VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z) - Moment Quantization for Video Temporal Grounding [29.081100914208974]
We propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG)<n>MQVTG quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments.<n>Our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination.
arXiv Detail & Related papers (2025-04-03T05:21:14Z) - Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.