A Policy-Improved Deep Deterministic Policy Gradient Framework for the Discount Order Acceptance Strategy of Ride-hailing Drivers
- URL: http://arxiv.org/abs/2507.11865v1
- Date: Wed, 16 Jul 2025 03:24:54 GMT
- Title: A Policy-Improved Deep Deterministic Policy Gradient Framework for the Discount Order Acceptance Strategy of Ride-hailing Drivers
- Authors: Hanwen Dai, Chang Gao, Fang He, Congyuan Ji, Yanni Yang,
- Abstract summary: Third-party provide Discount Express service delivered by express drivers at lower trip fares.<n>This study aims to dynamically manage drivers' acceptance of Discount Express from the perspective of individual platforms.<n>We propose a policy-improved deep deterministic policy gradient (pi-DDPG) framework.
- Score: 7.172675922077926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid expansion of platform integration has emerged as an effective solution to mitigate market fragmentation by consolidating multiple ride-hailing platforms into a single application. To address heterogeneous passenger preferences, third-party integrators provide Discount Express service delivered by express drivers at lower trip fares. For the individual platform, encouraging broader participation of drivers in Discount Express services has the potential to expand the accessible demand pool and improve matching efficiency, but often at the cost of reduced profit margins. This study aims to dynamically manage drivers' acceptance of Discount Express from the perspective of individual platforms. The lack of historical data under the new business model necessitates online learning. However, early-stage exploration through trial and error can be costly in practice, highlighting the need for reliable early-stage performance in real-world deployment. To address these challenges, this study formulates the decision regarding the proportion of drivers' acceptance behavior as a continuous control task. In response to the high stochasticity, the opaque matching mechanisms employed by third-party integrator, and the limited availability of historical data, we propose a policy-improved deep deterministic policy gradient (pi-DDPG) framework. The proposed framework incorporates a refiner module to boost policy performance during the early training phase, leverages a convolutional long short-term memory network to effectively capture complex spatiotemporal patterns, and adopts a prioritized experience replay mechanism to enhance learning efficiency. A simulator based on a real-world dataset is developed to validate the effectiveness of the proposed pi-DDPG. Numerical experiments demonstrate that pi-DDPG achieves superior learning efficiency and significantly reduces early-stage training losses.
Related papers
- On-Device Diffusion Transformer Policy for Efficient Robot Manipulation [26.559546714450324]
Diffusion Policies have significantly advanced robotic manipulation tasks via imitation learning.<n>Their application on resource-constrained mobile platforms remains challenging due to computational inefficiency and extensive memory footprint.<n>We propose LightDP, a novel framework specifically designed to accelerate Diffusion Policies for real-time deployment on mobile devices.
arXiv Detail & Related papers (2025-08-01T15:14:39Z) - One Step is Enough: Multi-Agent Reinforcement Learning based on One-Step Policy Optimization for Order Dispatch on Ride-Sharing Platforms [11.43941442981793]
MARL-based ride-sharing approaches heavily rely on the accurate estimation of Q-values or V-values.<n>We propose two novel alternative methods that bypass value function estimation.<n>First, we adapt GRPO to ride-sharing, replacing the PPO baseline with the group average reward to eliminate critic estimation errors.<n>Second, inspired by GRPO's full utilization of group reward information, we customize the PPO framework for ride-sharing platforms and show that, under a homogeneous fleet, the optimal policy can be trained using only one-step rewards.
arXiv Detail & Related papers (2025-07-21T08:04:31Z) - Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [62.579951798437115]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z) - Order Acquisition Under Competitive Pressure: A Rapidly Adaptive Reinforcement Learning Approach for Ride-Hailing Subsidy Strategies [0.5717569761927883]
We propose Fast Competition Adaptation (FCA) and Reinforced Lagrangian Adjustment (RLA) to rapidly adapt to competitors' pricing adjustments.<n>Our approach integrates two key techniques: Fast Competition Adaptation (FCA), which enables swift responses to dynamic price changes, and Reinforced Lagrangian Adjustment (RLA), which ensures adherence to budget constraints.<n> Experimental results demonstrate that our proposed method consistently outperforms baseline approaches across diverse market conditions.
arXiv Detail & Related papers (2025-07-03T02:38:42Z) - ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding [71.654781631463]
ReAgent-V is a novel agentic video understanding framework.<n>It integrates efficient frame selection with real-time reward generation during inference.<n>Extensive experiments on 12 datasets demonstrate significant gains in generalization and reasoning.
arXiv Detail & Related papers (2025-06-02T04:23:21Z) - SEVA: Leveraging Single-Step Ensemble of Vicinal Augmentations for Test-Time Adaptation [29.441669360316418]
Test-Time adaptation (TTA) aims to enhance model robustness against distribution shifts through rapid model adaptation during inference.<n> augmentation strategies can effectively unleash the potential of reliable samples, but the rapidly growing computational cost impedes their real-time application.<n>We propose a novel TTA approach named Single-step Ensemble of Vicinal Augmentations (SEVA) which can take advantage of data augmentations without increasing the computational burden.
arXiv Detail & Related papers (2025-05-07T02:58:37Z) - OffRIPP: Offline RL-based Informative Path Planning [12.705099730591671]
IPP is a crucial task in robotics, where agents must design paths to gather valuable information about a target environment.
We propose an offline RL-based IPP framework that optimized information gain without requiring real-time interaction during training.
We validate the framework through extensive simulations and real-world experiments.
arXiv Detail & Related papers (2024-09-25T11:30:59Z) - PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation [61.57833648734164]
We propose a novel Parallel Yielding Re-Activation (PYRA) method for training-inference efficient task adaptation.
PYRA outperforms all competing methods under both low compression rate and high compression rate.
arXiv Detail & Related papers (2024-03-14T09:06:49Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Fair and Efficient Distributed Edge Learning with Hybrid Multipath TCP [62.81300791178381]
The bottleneck of distributed edge learning over wireless has shifted from computing to communication.
Existing TCP-based data networking schemes for DEL are application-agnostic and fail to deliver adjustments according to application layer requirements.
We develop a hybrid multipath TCP (MP TCP) by combining model-based and deep reinforcement learning (DRL) based MP TCP for DEL.
arXiv Detail & Related papers (2022-11-03T09:08:30Z) - DROP: Deep relocating option policy for optimal ride-hailing vehicle
repositioning [36.31945021412277]
In a ride-hailing system, an optimal relocation of vacant vehicles can significantly reduce fleet idling time and balance the supply-demand distribution.
This study proposes the deep relocating option policy (DROP) that supervises vehicle agents to escape from oversupply areas.
We present a hierarchical learning framework that trains a high-level relocation policy and a set of low-level DROPs.
arXiv Detail & Related papers (2021-09-09T10:20:53Z) - Value Function is All You Need: A Unified Learning Framework for Ride
Hailing Platforms [57.21078336887961]
Large ride-hailing platforms, such as DiDi, Uber and Lyft, connect tens of thousands of vehicles in a city to millions of ride demands throughout the day.
We propose a unified value-based dynamic learning framework (V1D3) for tackling both tasks.
arXiv Detail & Related papers (2021-05-18T19:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.