On Geometric Structures for Policy Parameterization in Continuous Control
- URL: http://arxiv.org/abs/2511.08234v2
- Date: Sun, 16 Nov 2025 21:02:53 GMT
- Title: On Geometric Structures for Policy Parameterization in Continuous Control
- Authors: Zhihao Lin,
- Abstract summary: We propose a novel, computationally efficient action generation paradigm that preserves the structural benefits of operating on a unit manifold.<n>Our method decomposes the action into a deterministic directional vector and a learnable concentration, enabling efficient between the target direction and uniform noise.<n> Empirically, our method matches or exceeds state-of-the-art methods on standard continuous control benchmarks.
- Score: 7.056222499095849
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Standard stochastic policies for continuous control often rely on ad-hoc boundary-enforcing transformations (e.g., tanh) which can distort the underlying optimization landscape and introduce gradient pathologies. While alternative parameterizations on the unit manifold (e.g., directional distributions) are theoretically appealing, their computational complexity (often requiring special functions or rejection sampling) has limited their practical use. We propose a novel, computationally efficient action generation paradigm that preserves the structural benefits of operating on a unit manifold. Our method decomposes the action into a deterministic directional vector and a learnable concentration scalar, enabling efficient interpolation between the target direction and uniform noise on the unit manifold. This design can reduce policy head parameters by nearly 50\% (from $2d$ to $d+1$) and maintains a simple $O(d)$ sampling complexity, avoiding costly sampling procedures. Empirically, our method matches or exceeds state-of-the-art methods on standard continuous control benchmarks, with significant improvements (e.g., +37.6\% and +112\%) on high-dimensional locomotion tasks. Ablation studies confirm that both the unit-norm normalization and the adaptive concentration mechanism are essential to the method's success. These findings suggest that robust, efficient control can be achieved by explicitly respecting the structure of bounded action spaces, rather than relying on complex, unbounded distributions. Code is available in supplementary materials.
Related papers
- ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference [60.958331943869126]
ODAR-Expert is an adaptive routing framework that optimize the accuracy-efficiency trade-off via principled resource allocation.<n>We show strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam.
arXiv Detail & Related papers (2026-02-27T05:22:01Z) - TABES: Trajectory-Aware Backward-on-Entropy Steering for Masked Diffusion Models [35.327100592206115]
Backward-on-Entropy (BoE) Steering is a gradient-guided inference framework that approximates infinite-horizon context via a single backward pass.<n>To ensure scalability, we introduce ttexttActiveQueryAttention, a sparse adjoint primitive that exploits the structure of the masking objective to reduce backward pass complexity.
arXiv Detail & Related papers (2026-01-30T19:10:32Z) - Distributionally Robust Control with End-to-End Statistically Guaranteed Metric Learning [5.309590159815129]
We propose a novel end-to-end finite-horizon Wasserstein DRC framework.<n>It integrates the learning of anisotropic Wasserstein metrics with downstream control tasks in a closed-loop manner.<n>We show that the proposed framework achieves superior closed-loop performance and robustness compared with state-of-the-art methods.
arXiv Detail & Related papers (2025-10-11T13:40:49Z) - Reasoning through Exploration: A Reinforcement Learning Framework for Robust Function Calling [35.97270347306353]
We propose textbfEGPO, a new RL framework built upon Group Relative Policy Optimization (GRPO)<n>The core of EGPO is an entropy-enhanced advantage function that integrates the entropy of the model's Chain-of-Thought (CoT) into the policy gradient.<n>On the challenging Berkeley Function Calling Leaderboard (BFCL), a 4B- parameter model trained with EGPO sets a new state-of-the-art among models of comparable size.
arXiv Detail & Related papers (2025-08-07T07:51:38Z) - GeoAda: Efficiently Finetune Geometric Diffusion Models with Equivariant Adapters [61.51810815162003]
We propose an SE(3)-equivariant adapter framework ( GeoAda) that enables flexible and parameter-efficient fine-tuning for controlled generative tasks.<n>GeoAda preserves the model's geometric consistency while mitigating overfitting and catastrophic forgetting.<n>We demonstrate the wide applicability of GeoAda across diverse geometric control types, including frame control, global control, subgraph control, and a broad range of application domains.
arXiv Detail & Related papers (2025-07-02T18:44:03Z) - Distribution Parameter Actor-Critic: Shifting the Agent-Environment Boundary for Diverse Action Spaces [22.711839917754375]
We introduce a novel reinforcement learning (RL) framework that treats distribution parameters as actions.<n>This reization makes the new action space continuous, regardless of the original action type.<n>We demonstrate competitive performance on the same environments with discretized action spaces.
arXiv Detail & Related papers (2025-06-19T21:19:19Z) - Flow Matching Ergodic Coverage [0.0]
Existing ergodic coverage methods are constrained by the limited set of ergodic metrics available for control synthesis.<n>We propose an alternative approach to ergodic coverage based on flow matching, a technique widely used in generative inference for efficient and scalable sampling.<n>Our formulation enables alternative ergodic metrics from generative inference that overcome the limitations of existing ones.
arXiv Detail & Related papers (2025-04-24T18:18:35Z) - Tuning-Free Structured Sparse PCA via Deep Unfolding Networks [5.931547772157972]
We propose a new type of sparse principal component analysis (PCA) for unsupervised feature selection (UFS)<n>We use an interpretable deep unfolding network that translates iterative optimization steps into trainable neural architectures.<n>This innovation enables automatic learning of the regularization parameters, effectively bypassing the empirical tuning requirements of conventional methods.
arXiv Detail & Related papers (2025-02-28T08:32:51Z) - Generalization Bounds of Surrogate Policies for Combinatorial Optimization Problems [53.03951222945921]
We analyze smoothed (perturbed) policies, adding controlled random perturbations to the direction used by the linear oracle.<n>Our main contribution is a generalization bound that decomposes the excess risk into perturbation bias, statistical estimation error, and optimization error.<n>We illustrate the scope of the results on applications such as vehicle scheduling, highlighting how smoothing enables both tractable training and controlled generalization.
arXiv Detail & Related papers (2024-07-24T12:00:30Z) - Disentangled Representation Learning with the Gromov-Monge Gap [65.73194652234848]
Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning.<n>We introduce a novel approach to disentangled representation learning based on quadratic optimal transport.<n>We demonstrate the effectiveness of our approach for quantifying disentanglement across four standard benchmarks.
arXiv Detail & Related papers (2024-07-10T16:51:32Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - FlowPG: Action-constrained Policy Gradient with Normalizing Flows [14.98383953401637]
Action-constrained reinforcement learning (ACRL) is a popular approach for solving safety-critical resource-alential related decision making problems.
A major challenge in ACRL is to ensure agent taking a valid action satisfying constraints in each step.
arXiv Detail & Related papers (2024-02-07T11:11:46Z) - Stochastic Optimal Control Matching [53.156277491861985]
Our work introduces Optimal Control Matching (SOCM), a novel Iterative Diffusion Optimization (IDO) technique for optimal control.
The control is learned via a least squares problem by trying to fit a matching vector field.
Experimentally, our algorithm achieves lower error than all the existing IDO techniques for optimal control.
arXiv Detail & Related papers (2023-12-04T16:49:43Z) - Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level
Stability and High-Level Behavior [51.60683890503293]
We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling.
We show that pure supervised cloning can generate trajectories matching the per-time step distribution of arbitrary expert trajectories.
arXiv Detail & Related papers (2023-07-27T04:27:26Z) - Learning Sampling Distributions for Model Predictive Control [36.82905770866734]
Sampling-based approaches to Model Predictive Control (MPC) have become a cornerstone of contemporary approaches to MPC.
We propose to carry out all operations in the latent space, allowing us to take full advantage of the learned distribution.
Specifically, we frame the learning problem as bi-level optimization and show how to train the controller with backpropagation-through-time.
arXiv Detail & Related papers (2022-12-05T20:35:36Z) - Smoothing Policy Iteration for Zero-sum Markov Games [9.158672246275348]
We propose the smoothing policy robustness (SPI) algorithm to solve the zero-sum MGs approximately.
Specially, the adversarial policy is served as the weight function to enable an efficient sampling over action spaces.
We also propose a model-based algorithm called Smooth adversarial Actor-critic (SaAC) by extending SPI with the function approximations.
arXiv Detail & Related papers (2022-12-03T14:39:06Z) - Deep Learning Approximation of Diffeomorphisms via Linear-Control
Systems [91.3755431537592]
We consider a control system of the form $dot x = sum_i=1lF_i(x)u_i$, with linear dependence in the controls.
We use the corresponding flow to approximate the action of a diffeomorphism on a compact ensemble of points.
arXiv Detail & Related papers (2021-10-24T08:57:46Z) - Improper Learning with Gradient-based Policy Optimization [62.50997487685586]
We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process.
We propose a gradient-based approach that operates over a class of improper mixtures of the controllers.
arXiv Detail & Related papers (2021-02-16T14:53:55Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.