Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows
- URL: http://arxiv.org/abs/2602.09580v1
- Date: Tue, 10 Feb 2026 09:28:20 GMT
- Title: Sample-Efficient Real-World Dexterous Policy Fine-Tuning via Action-Chunked Critics and Normalizing Flows
- Authors: Chenyu Yang, Denis Tarasov, Davide Liconti, Hehui Zheng, Robert K. Katzschmann,
- Abstract summary: Real-world fine-tuning of dexterous manipulation policies is challenging due to limited real-world interaction budgets and highly multimodal action distributions.<n>We present SOFT-FLOW, a sample-efficient off-policy fine-tuning framework with normalizing flow (NF)<n>This is the first demonstration of a likelihood-based, multimodal generative policy combined with chunk-level value learning on real robotic hardware.
- Score: 11.159970460746164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world fine-tuning of dexterous manipulation policies remains challenging due to limited real-world interaction budgets and highly multimodal action distributions. Diffusion-based policies, while expressive, do not permit conservative likelihood-based updates during fine-tuning because action probabilities are intractable. In contrast, conventional Gaussian policies collapse under multimodality, particularly when actions are executed in chunks, and standard per-step critics fail to align with chunked execution, leading to poor credit assignment. We present SOFT-FLOW, a sample-efficient off-policy fine-tuning framework with normalizing flow (NF) to address these challenges. The normalizing flow policy yields exact likelihoods for multimodal action chunks, allowing conservative, stable policy updates through likelihood regularization and thereby improving sample efficiency. An action-chunked critic evaluates entire action sequences, aligning value estimation with the policy's temporal structure and improving long-horizon credit assignment. To our knowledge, this is the first demonstration of a likelihood-based, multimodal generative policy combined with chunk-level value learning on real robotic hardware. We evaluate SOFT-FLOW on two challenging dexterous manipulation tasks in the real world: cutting tape with scissors retrieved from a case, and in-hand cube rotation with a palm-down grasp -- both of which require precise, dexterous control over long horizons. On these tasks, SOFT-FLOW achieves stable, sample-efficient adaptation where standard methods struggle.
Related papers
- Primary-Fine Decoupling for Action Generation in Robotic Imitation [91.2899765310853]
Multi-modal distribution in robotic manipulation action sequences poses critical challenges for imitation learning.<n>We propose Primary-Fine Decoupling for Action Generation (PF-DAG), a two-stage framework that decouples coarse action consistency from fine-grained variations.<n>PF-DAG outperforms state-of-the-art baselines across 56 tasks from Adroit, DexArt, and MetaWorld benchmarks.
arXiv Detail & Related papers (2026-02-25T08:36:45Z) - PolicyFlow: Policy Optimization with Continuous Normalizing Flow in Reinforcement Learning [6.836651088754774]
PolicyFlow is a novel on-policy CNF-based reinforcement learning algorithm.<n>It integrates expressive CNF policies with PPO-style objectives without requiring likelihood evaluation along the full flow path.<n>PolicyFlow approximates importance ratios using velocity field variations along a simple path, reducing computational overhead without compromising training stability.
arXiv Detail & Related papers (2026-02-01T11:08:09Z) - Ratio-Variance Regularized Policy Optimization for Efficient LLM Fine-tuning [48.34492357368989]
We propose a primal-dual framework that supports stable on-policy learning and enables principled off-policy data reuse.<n>$R2VPO$ achieves superior performance with average relative gains of up to 17% over strong clipping-based baselines.
arXiv Detail & Related papers (2026-01-06T14:01:42Z) - Decoupled Q-Chunking [63.864222078287575]
We use chunked critics to estimate the value of short action sequences ("chunks") rather than individual actions, speeding up value backup.<n>Our key insight is to decouple the chunk length of the critic from that of the policy, allowing the policy to operate over shorter action chunks.<n>This design retains the benefits of multi-step value propagation while sidestepping both the open-loop sub-optimality and the difficulty of learning action chunking policies for long action chunks.
arXiv Detail & Related papers (2025-12-11T18:52:51Z) - Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning [10.037416068775853]
We introduce Guided Flow Policy, which couples a multi-step flow-matching policy with a distilled one-step actor.<n>The actor directs the flow policy through weighted behavior cloning to focus on cloning high-value actions from the dataset.<n>This mutual guidance enables GFP to achieve state-of-the-art performance across 144 state and pixel-based tasks.
arXiv Detail & Related papers (2025-12-03T17:05:58Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - Customize Multi-modal RAI Guardrails with Precedent-based predictions [55.63757336900865]
A multi-modal guardrail must effectively filter image content based on user-defined policies.<n>Existing fine-tuning methods typically condition predictions on pre-defined policies.<n>We propose to condition model's judgment on "precedents", which are the reasoning processes of prior data points similar to the given input.
arXiv Detail & Related papers (2025-07-28T03:45:34Z) - Accuracy of Discretely Sampled Stochastic Policies in Continuous-time Reinforcement Learning [3.973277434105709]
We rigorously analyze a policy execution framework that samples actions from a policy at discrete time points and implements them as piecewise constant controls.<n>We prove that as the sampling mesh size tends to zero, the controlled state process converges weakly to the dynamics with coefficients according to the policy.<n>Building on these results, we analyze the bias and variance of various policy gradient estimators based on discrete-time observations.
arXiv Detail & Related papers (2025-03-13T02:35:23Z) - How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation [17.638831964639834]
Behavior cloning policies are increasingly successful at solving complex tasks by learning from human demonstrations.
We present a framework that provides a tight lower-bound on robot performance in an arbitrary environment.
In experiments we evaluate policies for visuomotor manipulation in both simulation and hardware.
arXiv Detail & Related papers (2024-05-08T22:00:35Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.