Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress
- URL: http://arxiv.org/abs/2509.24129v1
- Date: Sun, 28 Sep 2025 23:56:07 GMT
- Title: Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress
- Authors: Priyanka Mandikal, Jiaheng Hu, Shivin Dass, Sagnik Majumder, Roberto Martín-Martín, Kristen Grauman,
- Abstract summary: We present SPARTA, the first unified framework for the family of object state change manipulation tasks.<n>SPARTA integrates spatially progressing object change segmentation maps, a visual skill to perceive actionable vs. transformed regions, and dense rewards that capture incremental progress over time.<n>We validate SPARTA on a real robot for three challenging tasks across 10 diverse real-world objects.
- Score: 53.723881111373736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most robot manipulation focuses on changing the kinematic state of objects: picking, placing, opening, or rotating them. However, a wide range of real-world manipulation tasks involve a different class of object state change--such as mashing, spreading, or slicing--where the object's physical and visual state evolve progressively without necessarily changing its position. We present SPARTA, the first unified framework for the family of object state change manipulation tasks. Our key insight is that these tasks share a common structural pattern: they involve spatially-progressing, object-centric changes that can be represented as regions transitioning from an actionable to a transformed state. Building on this insight, SPARTA integrates spatially progressing object change segmentation maps, a visual skill to perceive actionable vs. transformed regions for specific object state change tasks, to generate a) structured policy observations that strip away appearance variability, and b) dense rewards that capture incremental progress over time. These are leveraged in two SPARTA policy variants: reinforcement learning for fine-grained control without demonstrations or simulation; and greedy control for fast, lightweight deployment. We validate SPARTA on a real robot for three challenging tasks across 10 diverse real-world objects, achieving significant improvements in training time and accuracy over sparse rewards and visual goal-conditioned baselines. Our results highlight progress-aware visual representations as a versatile foundation for the broader family of object state manipulation tasks. Project website: https://vision.cs.utexas.edu/projects/sparta-robot
Related papers
- Tracking and Understanding Object Transformations [43.15129025464927]
We introduce the task of Track Any State: tracking objects through transformations while detecting and describing state changes.<n>We present TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how object states are evolving over time.<n>TubeletGraph achieves deeper understanding of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations.
arXiv Detail & Related papers (2025-11-06T18:59:30Z) - SPOC: Spatially-Progressing Object State Change Segmentation in Video [52.65373395382122]
We introduce the spatially-progressing object state change segmentation task.<n>The goal is to segment at the pixel-level those regions of an object that are actionable and those that are transformed.<n>We demonstrate useful implications for tracking activity progress to benefit robotic agents.
arXiv Detail & Related papers (2025-03-15T01:48:54Z) - M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation [51.82272563578793]
We introduce the concept of phase in segmentation, which categorizes real-world objects based on their visual characteristics and potential morphological and appearance changes.<n>We present a new benchmark, Multi-Phase, Multi-Transition, and Multi-Scenery Video Object (M$3$-VOS), to verify the ability of models to understand object phases.
arXiv Detail & Related papers (2024-12-18T12:50:11Z) - A Dataset and Framework for Learning State-invariant Object Representations [0.6577148087211809]
We present a novel dataset, ObjectsWithStateChange, which captures state and pose variations in the object images recorded from arbitrary viewpoints.<n>Our ablation related to the role played by curriculum learning indicates an improvement in object recognition accuracy of 7.9% and retrieval mAP of 9.2% over the state-of-the-art on our new dataset.
arXiv Detail & Related papers (2024-04-09T17:17:48Z) - OSCaR: Object State Captioning and State Change Representation [52.13461424520107]
This paper introduces the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark.
OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections.
It sets a new testbed for evaluating multimodal large language models (MLLMs)
arXiv Detail & Related papers (2024-02-27T01:48:19Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Learning Generalizable Manipulation Policies with Object-Centric 3D
Representations [65.55352131167213]
GROOT is an imitation learning method for learning robust policies with object-centric and 3D priors.
It builds policies that generalize beyond their initial training conditions for vision-based manipulation.
GROOT's performance excels in generalization over background changes, camera viewpoint shifts, and the presence of new object instances.
arXiv Detail & Related papers (2023-10-22T18:51:45Z) - Semantically Grounded Object Matching for Robust Robotic Scene
Rearrangement [21.736603698556042]
We present a novel approach to object matching that uses a large pre-trained vision-language model to match objects in a cross-instance setting.
We demonstrate that this provides considerably improved matching performance in cross-instance settings.
arXiv Detail & Related papers (2021-11-15T18:39:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.