Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation
- URL: http://arxiv.org/abs/2602.09940v1
- Date: Tue, 10 Feb 2026 16:25:39 GMT
- Title: Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation
- Authors: Archit Sharma, Dharmendra Sharma, John Rebeiro, Peeyush Thakur, Narendra Dhar, Laxmidhar Behera,
- Abstract summary: We develop a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation.<n>Instruct2Act attains 91.5% sub-actions prediction accuracy while retaining a small footprint.<n>Results demonstrate that fine-grained instruction-to-action parsing, coupled with DATRN-based trajectory generation and vision-guided grounding, provides a practical path to deterministic, real-time manipulation.
- Score: 14.833622989644352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robots often struggle to follow free-form human instructions in real-world settings due to computational and sensing limitations. We address this gap with a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation. Our approach has two stages: (i) the instruction to actions module (Instruct2Act), a compact BiLSTM with a multi-head-attention autoencoder that parses an instruction into an ordered sequence of atomic actions (e.g., reach, grasp, move, place); and (ii) the robot action network (RAN), which uses the dynamic adaptive trajectory radial network (DATRN) together with a vision-based environment analyzer (YOLOv8) to generate precise control trajectories for each sub-action. The entire system runs on a modest system with no cloud services. On our custom proprietary dataset, Instruct2Act attains 91.5% sub-actions prediction accuracy while retaining a small footprint. Real-robot evaluations across four tasks (pick-place, pick-pour, wipe, and pick-give) yield an overall 90% success; sub-action inference completes in < 3.8s, with end-to-end executions in 30-60s depending on task complexity. These results demonstrate that fine-grained instruction-to-action parsing, coupled with DATRN-based trajectory generation and vision-guided grounding, provides a practical path to deterministic, real-time manipulation in resource-constrained, single-camera settings.
Related papers
- Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons [69.87766750714945]
General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations.<n>We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision.<n>Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints.
arXiv Detail & Related papers (2026-03-02T17:38:58Z) - InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy [138.89177083578213]
We introduce InternVLA-M1, a unified framework for spatial grounding and robot control.<n>InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data, and (ii) spatially guided action post-training.<n>Results: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka.
arXiv Detail & Related papers (2025-10-15T17:30:05Z) - Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration [58.4036440289082]
Hand-object motion-capture (MoCap) offer large-scale, contact-rich demonstrations and hold promise for dexterous robotic scopes.<n>We introduce Dexplore, a unified single-loop optimization that performs repositories and tracking to learn robot control policies directly from MoCap at scale.
arXiv Detail & Related papers (2025-09-11T17:59:07Z) - Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning [62.3886343725955]
We introduce a novel value-based reinforcement learning algorithm that learns a critic network that outputs Q-values over a sequence of actions.<n>Experiments show that CQN-AS outperforms several baselines on a variety of sparse-reward humanoid control and tabletop manipulation tasks.
arXiv Detail & Related papers (2024-11-19T01:23:52Z) - PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation [68.17081518640934]
We propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R)
PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module.
Our PIVOT-R outperforms state-of-the-art open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks.
arXiv Detail & Related papers (2024-10-14T11:30:18Z) - Autoregressive Action Sequence Learning for Robotic Manipulation [32.9580007141312]
Existing autoregressive architectures generate end-effector waypoints sequentially as word tokens in language modeling.<n>We extend causal transformers' single-token prediction to support predicting a variable number of tokens in a single step.<n>We propose the Autoregressive Policy architecture, which solves manipulation tasks by generating hybrid action sequences.
arXiv Detail & Related papers (2024-10-04T04:07:15Z) - Affordance-based Robot Manipulation with Flow Matching [7.51335919610328]
We present a framework for assistive robot manipulation.<n>We tackle two challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, and second, effectively learning robot action trajectories by grounding the visual affordance model.<n>We learn robot action trajectories guided by affordances in a supervised flow matching method.
arXiv Detail & Related papers (2024-09-02T09:11:28Z) - Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation.
Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal.
We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z) - RoboScript: Code Generation for Free-Form Manipulation Tasks across Real
and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation.
We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language.
We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z) - Verifiably Following Complex Robot Instructions with Foundation Models [16.564788361518197]
Language Instruction grounding for Motion Planning (LIMP) is an approach that enables robots to follow complex, open-ended instructions in real-world environments without prebuilt semantic maps.<n>LIMP constructs symbolic instruction representation that reveals the robot's alignment with an instructor's intended motives.<n>We conduct a large-scale evaluation of LIMP on 150 instructions across five real-world environments, demonstrating its versatility and ease of deployment.
arXiv Detail & Related papers (2024-02-18T08:05:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.