SPOC: Spatially-Progressing Object State Change Segmentation in Video
- URL: http://arxiv.org/abs/2503.11953v1
- Date: Sat, 15 Mar 2025 01:48:54 GMT
- Title: SPOC: Spatially-Progressing Object State Change Segmentation in Video
- Authors: Priyanka Mandikal, Tushar Nagarajan, Alex Stoken, Zihui Xue, Kristen Grauman,
- Abstract summary: We introduce the spatially-progressing object state change segmentation task.<n>The goal is to segment at the pixel-level those regions of an object that are actionable and those that are transformed.<n>We demonstrate useful implications for tracking activity progress to benefit robotic agents.
- Score: 52.65373395382122
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object state changes in video reveal critical information about human and agent activity. However, existing methods are limited to temporal localization of when the object is in its initial state (e.g., the unchopped avocado) versus when it has completed a state change (e.g., the chopped avocado), which limits applicability for any task requiring detailed information about the progress of the actions and its spatial localization. We propose to deepen the problem by introducing the spatially-progressing object state change segmentation task. The goal is to segment at the pixel-level those regions of an object that are actionable and those that are transformed. We introduce the first model to address this task, designing a VLM-based pseudo-labeling approach, state-change dynamics constraints, and a novel WhereToChange benchmark built on in-the-wild Internet videos. Experiments on two datasets validate both the challenge of the new task as well as the promise of our model for localizing exactly where and how fast objects are changing in video. We further demonstrate useful implications for tracking activity progress to benefit robotic agents. Project page: https://vision.cs.utexas.edu/projects/spoc-spatially-progressing-osc
Related papers
- Segment Any Motion in Videos [80.72424676419755]
We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features.
Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support.
arXiv Detail & Related papers (2025-03-28T09:34:11Z) - ActionVOS: Actions as Prompts for Video Object Segmentation [22.922260726461477]
ActionVOS aims at segmenting only active objects in egocentric videos using human actions as a key language prompt.
We develop an action-aware labeling module with an efficient action-guided focal loss.
Experiments show that ActionVOS significantly reduces the mis-segmentation of inactive objects.
arXiv Detail & Related papers (2024-07-10T06:57:04Z) - Anticipating Object State Changes in Long Procedural Videos [0.8428703116072809]
The proposed framework predicts object state changes that will occur in the near future due to yet unseen human actions.
It integrates learned visual features that represent recent visual information with natural language (NLP) features that represent past object state changes and actions.
The proposed approach also underscores the potential of integrating video and linguistic cues to enhance the predictive performance of video understanding systems.
arXiv Detail & Related papers (2024-05-21T13:40:30Z) - OSCaR: Object State Captioning and State Change Representation [52.13461424520107]
This paper introduces the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark.
OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections.
It sets a new testbed for evaluating multimodal large language models (MLLMs)
arXiv Detail & Related papers (2024-02-27T01:48:19Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Breaking the "Object" in Video Object Segmentation [36.20167854011788]
We present a dataset for Video Object under Transformations (VOST)
It consists of more than 700 high-resolution videos, captured in diverse environments, which are 21 seconds long average and densely labeled with masks instance.
A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent.
We show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues.
arXiv Detail & Related papers (2022-12-12T19:22:17Z) - Is an Object-Centric Video Representation Beneficial for Transfer? [86.40870804449737]
We introduce a new object-centric video recognition model on a transformer architecture.
We show that the object-centric model outperforms prior video representations.
arXiv Detail & Related papers (2022-07-20T17:59:44Z) - DyStaB: Unsupervised Object Segmentation via Dynamic-Static
Bootstrapping [72.84991726271024]
We describe an unsupervised method to detect and segment portions of images of live scenes that are seen moving as a coherent whole.
Our method first partitions the motion field by minimizing the mutual information between segments.
It uses the segments to learn object models that can be used for detection in a static image.
arXiv Detail & Related papers (2020-08-16T22:05:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.