Related papers: Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies

Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies

URL: http://arxiv.org/abs/2508.05433v2
Date: Fri, 31 Oct 2025 09:00:22 GMT
Title: Multimodal LLM-assisted Evolutionary Search for Programmatic Control Policies
Authors: Qinglong Hu, Xialiang Tong, Mingxuan Yuan, Fei Liu, Zhichao Lu, Qingfu Zhang,
Abstract summary: This work introduces a novel approach for programmatic control policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES)<n>MLES utilizes multimodal large language models as programmatic policy generators, combining them with evolutionary search to automate policy generation.<n> Experimental results demonstrate that MLES achieves performance comparable to Proximal Policy Optimization (PPO) across two standard control tasks.
Score: 36.44665658496622
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep reinforcement learning has achieved impressive success in control tasks. However, its policies, represented as opaque neural networks, are often difficult for humans to understand, verify, and debug, which undermines trust and hinders real-world deployment. This work addresses this challenge by introducing a novel approach for programmatic control policy discovery, called Multimodal Large Language Model-assisted Evolutionary Search (MLES). MLES utilizes multimodal large language models as programmatic policy generators, combining them with evolutionary search to automate policy generation. It integrates visual feedback-driven behavior analysis within the policy generation process to identify failure patterns and guide targeted improvements, thereby enhancing policy discovery efficiency and producing adaptable, human-aligned policies. Experimental results demonstrate that MLES achieves performance comparable to Proximal Policy Optimization (PPO) across two standard control tasks while providing transparent control logic and traceable design processes. This approach also overcomes the limitations of predefined domain-specific languages, facilitates knowledge transfer and reuse, and is scalable across various tasks, showing promise as a new paradigm for developing transparent and verifiable control policies.

Related papers

Code Evolution for Control: Synthesizing Policies via LLM-Driven Evolutionary Search [15.216159860533397]
We show that evolutionary search can effectively synthesize interpretable control policies in the form of executable code.<n>We implement our approach using EvoToolkit, a framework that seamlessly integrates evolutionary computation with customizable fitness evaluation.<n>This work highlights the potential of combining foundation models with evolutionary computation for trustworthy control policies in autonomous systems.
arXiv Detail & Related papers (2026-01-11T10:21:22Z)
Policy-Conditioned Policies for Multi-Agent Task Solving [53.67744322553693]
In this work, we propose a paradigm shift that bridges the gap by representing policies as human-interpretable source code.<n>We reformulate the learning problem by utilizing Large Language Models (LLMs) as approximate interpreters.<n>We formalize this process as textitProgrammatic Iterated Best Response (PIBR), an algorithm where the policy code is optimized by textual gradients.
arXiv Detail & Related papers (2025-12-24T07:42:10Z)
HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents [6.861838493263133]
HyCodePolicy is a language-based control framework for embodied agents.<n>It integrates code synthesis, geometric grounding, perceptual monitoring, and iterative repair into a closed-loop programming cycle.<n>Our results demonstrate that HyCodePolicy significantly improves the robustness and sample efficiency of robot manipulation policies.
arXiv Detail & Related papers (2025-08-04T17:18:14Z)
Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems [25.882461853973897]
We propose Multi-Agent Heterogeneous Group Policy Optimization (MHGPO), which guides policy updates by estimating relative reward advantages.<n>MHGPO eliminates the need for Critic networks, enhancing stability and reducing computational overhead.<n>We also introduce three group rollout sampling strategies that trade off between efficiency and effectiveness.
arXiv Detail & Related papers (2025-06-03T10:17:19Z)
Synthesizing Interpretable Control Policies through Large Language Model Guided Search [7.706225175516503]
We represent control policies as programs in standard languages like Python.<n>We evaluate candidate controllers in simulation and evolve them using a pre-trained LLM.<n>We illustrate our method through its application to the synthesis of an interpretable control policy for the pendulum swing-up and the ball in cup tasks.
arXiv Detail & Related papers (2024-10-07T18:12:20Z)
Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient [26.675822002049372]
Deep Diffusion Policy Gradient (DDiffPG) is a novel actor-critic algorithm that learns from scratch multimodal policies. DDiffPG forms a multimodal training batch and utilizes mode-specific Q-learning to mitigate the inherent greediness of the RL objective. Our approach further allows the policy to be conditioned on mode-specific embeddings to explicitly control the learned modes.
arXiv Detail & Related papers (2024-06-02T09:32:28Z)
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.<n>We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)<n>Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.<n>We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z)
Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces. We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories. We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z)
Diverse Policy Optimization for Structured Action Space [59.361076277997704]
We propose Diverse Policy Optimization (DPO) to model the policies in structured action space as the energy-based models (EBM) A novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler. Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies.
arXiv Detail & Related papers (2023-02-23T10:48:09Z)
Efficient Domain Coverage for Vehicles with Second-Order Dynamics via Multi-Agent Reinforcement Learning [9.939081691797858]
We present a reinforcement learning (RL) approach for the multi-agent efficient domain coverage problem involving agents with second-order dynamics. Our proposed network architecture includes the incorporation of LSTM and self-attention, which allows the trained policy to adapt to a variable number of agents.
arXiv Detail & Related papers (2022-11-11T01:59:12Z)
Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm. We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z)
Policy Search for Model Predictive Control with Application to Agile Drone Flight [56.24908013905407]
We propose a policy-search-for-model-predictive-control framework for MPC. Specifically, we formulate the MPC as a parameterized controller, where the hard-to-optimize decision variables are represented as high-level policies. Experiments show that our controller achieves robust and real-time control performance in both simulation and the real world.
arXiv Detail & Related papers (2021-12-07T17:39:24Z)
Generative Actor-Critic: An Off-policy Algorithm Using the Push-forward Model [24.030426634281643]
In continuous control tasks, widely used policies with Gaussian distributions results in ineffective exploration of environments. We propose a density-free off-policy algorithm, Generative Actor-Critic, using the push-forward model to increase the expressiveness of policies. We show that push-forward policies possess desirable features, such as multi-modality, which can improve the efficiency of exploration and performance of algorithms obviously.
arXiv Detail & Related papers (2021-05-08T16:29:20Z)
Imitation Learning from MPC for Quadrupedal Multi-Gait Control [63.617157490920505]
We present a learning algorithm for training a single policy that imitates multiple gaits of a walking robot. We use and extend MPC-Net, which is an Imitation Learning approach guided by Model Predictive Control. We validate our approach on hardware and show that a single learned policy can replace its teacher to control multiple gaits.
arXiv Detail & Related papers (2021-03-26T08:48:53Z)
Continuous Action Reinforcement Learning from a Mixture of Interpretable Experts [35.80418547105711]
We propose a policy scheme that retains a complex function approxor for its internal value predictions but constrains the policy to have a concise, hierarchical, and human-readable structure. The main technical contribution of the paper is to address the challenges introduced by this non-differentiable state selection procedure.
arXiv Detail & Related papers (2020-06-10T16:02:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.