Related papers: Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation

Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation

URL: http://arxiv.org/abs/2602.09940v1
Date: Tue, 10 Feb 2026 16:25:39 GMT
Title: Instruct2Act: From Human Instruction to Actions Sequencing and Execution via Robot Action Network for Robotic Manipulation
Authors: Archit Sharma, Dharmendra Sharma, John Rebeiro, Peeyush Thakur, Narendra Dhar, Laxmidhar Behera,
Abstract summary: We develop a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation.<n>Instruct2Act attains 91.5% sub-actions prediction accuracy while retaining a small footprint.<n>Results demonstrate that fine-grained instruction-to-action parsing, coupled with DATRN-based trajectory generation and vision-guided grounding, provides a practical path to deterministic, real-time manipulation.
Score: 14.833622989644352
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robots often struggle to follow free-form human instructions in real-world settings due to computational and sensing limitations. We address this gap with a lightweight, fully on-device pipeline that converts natural-language commands into reliable manipulation. Our approach has two stages: (i) the instruction to actions module (Instruct2Act), a compact BiLSTM with a multi-head-attention autoencoder that parses an instruction into an ordered sequence of atomic actions (e.g., reach, grasp, move, place); and (ii) the robot action network (RAN), which uses the dynamic adaptive trajectory radial network (DATRN) together with a vision-based environment analyzer (YOLOv8) to generate precise control trajectories for each sub-action. The entire system runs on a modest system with no cloud services. On our custom proprietary dataset, Instruct2Act attains 91.5% sub-actions prediction accuracy while retaining a small footprint. Real-robot evaluations across four tasks (pick-place, pick-pour, wipe, and pick-give) yield an overall 90% success; sub-action inference completes in < 3.8s, with end-to-end executions in 30-60s depending on task complexity. These results demonstrate that fine-grained instruction-to-action parsing, coupled with DATRN-based trajectory generation and vision-guided grounding, provides a practical path to deterministic, real-time manipulation in resource-constrained, single-camera settings.

Related papers

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons [69.87766750714945]
General-purpose robot reward models are typically trained to predict absolute task progress from expert demonstrations.<n>We introduce Robometer, a scalable reward modeling framework that combines intra-trajectory progress supervision with inter-trajectory preference supervision.<n>Robometer is trained with a dual objective: a frame-level progress loss that anchors reward magnitude on expert data, and a trajectory-comparison preference loss that imposes global ordering constraints.
arXiv Detail & Related papers (2026-03-02T17:38:58Z)
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy [138.89177083578213]
We introduce InternVLA-M1, a unified framework for spatial grounding and robot control.<n>InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data, and (ii) spatially guided action post-training.<n>Results: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka.
arXiv Detail & Related papers (2025-10-15T17:30:05Z)
Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration [58.4036440289082]
Hand-object motion-capture (MoCap) offer large-scale, contact-rich demonstrations and hold promise for dexterous robotic scopes.<n>We introduce Dexplore, a unified single-loop optimization that performs repositories and tracking to learn robot control policies directly from MoCap at scale.
arXiv Detail & Related papers (2025-09-11T17:59:07Z)
Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning [62.3886343725955]
We introduce a novel value-based reinforcement learning algorithm that learns a critic network that outputs Q-values over a sequence of actions.<n>Experiments show that CQN-AS outperforms several baselines on a variety of sparse-reward humanoid control and tabletop manipulation tasks.
arXiv Detail & Related papers (2024-11-19T01:23:52Z)
PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation [68.17081518640934]
We propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. Our PIVOT-R outperforms state-of-the-art open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks.
arXiv Detail & Related papers (2024-10-14T11:30:18Z)
Autoregressive Action Sequence Learning for Robotic Manipulation [32.9580007141312]
Existing autoregressive architectures generate end-effector waypoints sequentially as word tokens in language modeling.<n>We extend causal transformers' single-token prediction to support predicting a variable number of tokens in a single step.<n>We propose the Autoregressive Policy architecture, which solves manipulation tasks by generating hybrid action sequences.
arXiv Detail & Related papers (2024-10-04T04:07:15Z)
Affordance-based Robot Manipulation with Flow Matching [7.51335919610328]
We present a framework for assistive robot manipulation.<n>We tackle two challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, and second, effectively learning robot action trajectories by grounding the visual affordance model.<n>We learn robot action trajectories guided by affordances in a supervised flow matching method.
arXiv Detail & Related papers (2024-09-02T09:11:28Z)
Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal. We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z)
RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation. We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language. We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z)
Verifiably Following Complex Robot Instructions with Foundation Models [16.564788361518197]
Language Instruction grounding for Motion Planning (LIMP) is an approach that enables robots to follow complex, open-ended instructions in real-world environments without prebuilt semantic maps.<n>LIMP constructs symbolic instruction representation that reveals the robot's alignment with an instructor's intended motives.<n>We conduct a large-scale evaluation of LIMP on 150 instructions across five real-world environments, demonstrating its versatility and ease of deployment.
arXiv Detail & Related papers (2024-02-18T08:05:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.