Related papers: MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding

URL: http://arxiv.org/abs/2510.09274v1
Date: Fri, 10 Oct 2025 11:18:21 GMT
Title: MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
Authors: Ming Dai, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang,
Abstract summary: Referring Video Object (RefVOS) seeks to segment target objects in videos guided by natural language descriptions.<n>We propose a unified framework that jointly optimize Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability.
Score: 40.37010049965347
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg

Related papers

SlowFocus: Enhancing Fine-grained Temporal Understanding in Video LLM [36.28285195488772]
Large language models (LLMs) have demonstrated exceptional capabilities in text understanding.<n>Vid-LLMs struggle to simultaneously retain high-quality frame-level semantic information.<n>This limitation hinders the advancement of Vid-LLMs towards fine-grained video understanding.
arXiv Detail & Related papers (2026-02-03T14:39:16Z)
Temporal Prompting Matters: Rethinking Referring Video Object Segmentation [64.82333675385802]
Referring Video Object (RVOS) aims to segment the object referred to by the query sentence in the video.<n>Most existing methods require end-to-end training with dense mask annotations.<n>We propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors.
arXiv Detail & Related papers (2025-10-08T17:59:57Z)
Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z)
DATE: Dynamic Absolute Time Enhancement for Long Video Understanding [8.720269393713451]
Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs)<n>We propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs.<n>We introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage.
arXiv Detail & Related papers (2025-09-11T08:49:22Z)
TAG: A Simple Yet Effective Temporal-Aware Approach for Zero-Shot Video Temporal Grounding [12.57589714067771]
Video Temporal Grounding aims to extract relevant video segments based on a given natural language query.<n>Previous approaches suffer from semantic fragmentation, where temporally continuous frames sharing the same semantics are split across multiple segments.<n>We propose a simple yet effective Temporal-Aware approach for zero-shot video temporal Grounding, which incorporates temporal pooling, temporal coherence clustering, and similarity adjustment.
arXiv Detail & Related papers (2025-08-11T12:38:46Z)
Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding [18.027290155746112]
Temporal Search is a training-free framework that enables MLLMs to explore temporal regions for improved long video understanding iteratively.<n>It is based on a key observation: the model's generation confidence across different temporal intervals is highly correlated with prediction accuracy.<n>It refines the focus of the model by iteratively shifting attention to more fine-grained temporal intervals, improving its understanding of long videos.
arXiv Detail & Related papers (2025-06-28T15:24:05Z)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
Moment Quantization for Video Temporal Grounding [29.081100914208974]
We propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG)<n>MQVTG quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments.<n>Our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination.
arXiv Detail & Related papers (2025-04-03T05:21:14Z)
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion. Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation. Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.