Related papers: Visual Prompting for Robotic Manipulation with Annotation-Guided Pick-and-Place Using ACT

Visual Prompting for Robotic Manipulation with Annotation-Guided Pick-and-Place Using ACT

URL: http://arxiv.org/abs/2508.08748v1
Date: Tue, 12 Aug 2025 08:45:09 GMT
Title: Visual Prompting for Robotic Manipulation with Annotation-Guided Pick-and-Place Using ACT
Authors: Muhammad A. Muttaqien, Tomohiro Motoda, Ryo Hanai, Yukiyasu Domae,
Abstract summary: This paper introduces a perception-action pipeline leveraging annotation-guided visual prompting.<n>We employ Action Chunking with Transformers (ACT) as an imitation learning algorithm, enabling the robotic arm to predict chunked action sequences from human demonstrations.<n>We evaluate our system based on success rate and visual analysis of grasping behavior, demonstrating improved grasp accuracy and adaptability in retail environments.
Score: 3.281128493853064
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robotic pick-and-place tasks in convenience stores pose challenges due to dense object arrangements, occlusions, and variations in object properties such as color, shape, size, and texture. These factors complicate trajectory planning and grasping. This paper introduces a perception-action pipeline leveraging annotation-guided visual prompting, where bounding box annotations identify both pickable objects and placement locations, providing structured spatial guidance. Instead of traditional step-by-step planning, we employ Action Chunking with Transformers (ACT) as an imitation learning algorithm, enabling the robotic arm to predict chunked action sequences from human demonstrations. This facilitates smooth, adaptive, and data-driven pick-and-place operations. We evaluate our system based on success rate and visual analysis of grasping behavior, demonstrating improved grasp accuracy and adaptability in retail environments.

Related papers

Towards an Accurate and Effective Robot Vision (The Problem of Topological Localization for Mobile Robots) [0.43064121494080315]
This work addresses topological localization in an office environment using only images acquired with a perspective color camera mounted on a robot platform.<n>We evaluate state-of-the-art visual descriptors, including Color Histograms, SIFT, ASIFT, RGB-SIFT, and Bag-of-Visual-Words approaches inspired by text retrieval.
arXiv Detail & Related papers (2025-09-05T09:14:59Z)
What to Do Next? Memorizing skills from Egocentric Instructional Video [43.59787683244105]
We present a novel task, interactive action planning, and propose an approach that combines topological affordance memory with transformer architecture.<n>Our experimental results demonstrate that the proposed approach learns meaningful representations, resulting in improved performance and robust when action deviations occur.
arXiv Detail & Related papers (2025-07-01T22:53:41Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Planning-Guided Diffusion Policy Learning for Generalizable Contact-Rich Bimanual Manipulation [16.244250979166214]
Generalizable Planning-Guided Diffusion Policy Learning (GLIDE) is an approach that learns to solve contact-rich bimanual manipulation tasks.<n>We propose a set of essential design options in feature extraction, task representation, action prediction, and data augmentation.<n>Our approach can enable a bimanual robotic system to effectively manipulate objects of diverse geometries, dimensions, and physical properties.
arXiv Detail & Related papers (2024-12-03T18:51:39Z)
Semantic-Geometric-Physical-Driven Robot Manipulation Skill Transfer via Skill Library and Tactile Representation [6.324290412766366]
We propose a knowledge graph-based skill library construction method to organize manipulation knowledge.<n>We also propose a novel hierarchical skill transfer framework based on the skill library and tactile representation.<n> Experiments demonstrate the skill transfer and adaptability capabilities of the proposed methods.
arXiv Detail & Related papers (2024-11-18T16:42:07Z)
Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking [57.942404069484134]
Articulated object manipulation requires precise object interaction, where the object's axis must be carefully considered.<n>Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics.<n>We present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds.
arXiv Detail & Related papers (2024-09-24T17:59:56Z)
A Robotics-Inspired Scanpath Model Reveals the Importance of Uncertainty and Semantic Object Cues for Gaze Guidance in Dynamic Scenes [8.64158103104882]
We present a computational model that simulates object segmentation and gaze behavior in an interconnected manner.<n>We show how our model's modular design allows for extensions, such as incorporating saccadic momentum or pre-saccadic attention.
arXiv Detail & Related papers (2024-08-02T15:20:34Z)
Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information [68.10033984296247]
This paper explores the domain of active localization, emphasizing the importance of viewpoint selection to enhance localization accuracy. Our contributions involve using a data-driven approach with a simple architecture designed for real-time operation, a self-supervised data training method, and the capability to consistently integrate our map into a planning framework tailored for real-world robotics applications.
arXiv Detail & Related papers (2024-07-22T12:32:09Z)
Towards Explainable Motion Prediction using Heterogeneous Graph Representations [3.675875935838632]
Motion prediction systems aim to capture the future behavior of traffic scenarios enabling autonomous vehicles to perform safe and efficient planning. GNN-based approaches have recently gained attention as they are well suited to naturally model these interactions. In this work, we aim to improve the explainability of motion prediction systems by using different approaches.
arXiv Detail & Related papers (2022-12-07T17:43:42Z)
Self-Supervision by Prediction for Object Discovery in Videos [62.87145010885044]
In this paper, we use the prediction task as self-supervision and build a novel object-centric model for image sequence representation. Our framework can be trained without the help of any manual annotation or pretrained network. Initial experiments confirm that the proposed pipeline is a promising step towards object-centric video prediction.
arXiv Detail & Related papers (2021-03-09T19:14:33Z)
Latent Space Roadmap for Visual Action Planning of Deformable and Rigid Object Manipulation [74.88956115580388]
Planning is performed in a low-dimensional latent state space that embeds images. Our framework consists of two main components: a Visual Foresight Module (VFM) that generates a visual plan as a sequence of images, and an Action Proposal Network (APN) that predicts the actions between them.
arXiv Detail & Related papers (2020-03-19T18:43:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.