Related papers: THU-Warwick Submission for EPIC-KITCHEN Challenge 2025: Semi-Supervised Video Object Segmentation

THU-Warwick Submission for EPIC-KITCHEN Challenge 2025: Semi-Supervised Video Object Segmentation

URL: http://arxiv.org/abs/2506.06748v1
Date: Sat, 07 Jun 2025 10:33:16 GMT
Title: THU-Warwick Submission for EPIC-KITCHEN Challenge 2025: Semi-Supervised Video Object Segmentation
Authors: Mingqi Gao, Haoran Duan, Tianlu Zhang, Jungong Han,
Abstract summary: Our method combines large-scale visual pretraining from SAM2 with depth-based geometric cues to handle complex scenes and long-term tracking.<n>On the VISOR test set, our method reaches a J&F score of 90.1%.
Score: 49.54727231738117
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this report, we describe our approach to egocentric video object segmentation. Our method combines large-scale visual pretraining from SAM2 with depth-based geometric cues to handle complex scenes and long-term tracking. By integrating these signals in a unified framework, we achieve strong segmentation performance. On the VISOR test set, our method reaches a J&F score of 90.1%.

Related papers

DC-SAM: In-Context Segment Anything in Images and Videos via Dual Consistency [91.30252180093333]
We propose the Dual Consistency SAM (DCSAM) method based on prompttuning to adapt SAM and SAM2 for in-context segmentation.<n>Our key insights are to enhance the features of the SAM's prompt encoder in segmentation by providing high-quality visual prompts.<n>Although the proposed DC-SAM is primarily designed for images, it can be seamlessly extended to the video domain with the support SAM2.
arXiv Detail & Related papers (2025-04-16T13:41:59Z)
3D-Aware Instance Segmentation and Tracking in Egocentric Videos [107.10661490652822]
Egocentric videos present unique challenges for 3D scene understanding. This paper introduces a novel approach to instance segmentation and tracking in first-person video. By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches.
arXiv Detail & Related papers (2024-08-19T10:08:25Z)
3rd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation [63.199793919573295]
Video Object (VOS) is a vital task in computer vision, focusing on distinguishing foreground objects from the background across video frames. Our work draws inspiration from the Cutie model, and we investigate the effects of object memory, the total number of memory frames, and input resolution on segmentation performance.
arXiv Detail & Related papers (2024-06-06T00:56:25Z)
ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: Semi-Supervised Video Object Segmentation [62.98078087018469]
We introduce MSDeAOT, a variant of the AOT framework that incorporates transformers at multiple feature scales. MSDeAOT efficiently propagates object masks from previous frames to the current frame using a feature scale with a stride of 16. We also employ GPM in a more refined feature scale with a stride of 8, leading to improved accuracy in detecting and tracking small objects.
arXiv Detail & Related papers (2023-07-05T03:43:15Z)
Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration [13.284951215948052]
We present a novel weakly supervised framework with textbfS-patiotextbfTemporal textbfClaboration for instance textbfSegmentation in videos. Our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN.
arXiv Detail & Related papers (2022-12-15T02:44:13Z)
Fast Video Object Segmentation With Temporal Aggregation Network and Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS) We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance. We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.