Differentiable Weights-Varying Nonlinear MPC via Gradient-based Policy Learning: An Autonomous Vehicle Guidance Example

Abstract

Tuning Model Predictive Control (MPC) cost weights for multiple, competing objectives is labor-intensive. Derivative-free automated methods, such as Bayesian Optimization, reduce manual effort but remain slow, while Differentiable MPC (Diff-MPC) exploits solver sensitivities for faster gradient-based tuning. However, existing Diff-MPC approaches learn a single global weight set, which may be suboptimal as operating conditions change. Conversely, black-box Reinforcement Learning Weights-Varying MPC (RL-WMPC) requires long training times and a lot of data. In this work, we introduce gradient-based policy learning for Differentiable Weights-Varying MPC (Diff-WMPC). By backpropagating solver-in-the-loop sensitivities through a lightweight policy that maps look-ahead observations to MPC weights, our Diff-WMPC yields rapid, sample-efficient adaptation at runtime. Extensive simulation on a full-scale racecar model demonstrates that Diff-WMPC outperforms state-of-the-art static-weight baselines and is competitive with weights-varying algorithms, while reducing training time from over an hour to under two minutes relative to RL-WMPC. The learned policy transfers zero-shot to unseen conditions and, with quick online fine-tuning, reaches environment-specific performance.

Contributions

Gradient-based Policy Learning for Weights-Varying MPC: We introduce Differentiable Weights-Varying MPC (Diff-WMPC), which leverages solver-in-the-loop sensitivities to train a lightweight policy that adapts MPC weights based on look-ahead observations.
Sample-Efficient Training: Our approach achieves competitive performance with state-of-the-art methods while reducing training time from over an hour (Reinforcement Learning Weights-Varying NMPC) to under two minutes, demonstrating significant improvements in computational efficiency.
Zero-Shot Transfer and Online Adaptability: The learned policy transfers zero-shot to unseen conditions and can be quickly fine-tuned online to reach environment-specific performance, enabling robust deployment across varying scenarios.

Approach

Our approach enables differentiable optimization for policy learning to enable adaptive, weights-varying Model Predictive Control. A lightweight neural network policy maps look-ahead observations to MPC cost weights (q and r), which parameterize the tracking and control effort trade-offs in the optimization problem. The MPC solver computes optimal controls u for the autonomous vehicle subject to nonlinear dynamics and constraints. By backpropagating gradients of a high-level task loss ℒ through the differentiable MPC solver, we efficiently train the policy to adapt weights based on upcoming trajectory conditions.

Results

Learning Control Policies in under Two Minutes

Diff-WMPC achieves over an order-of-magnitude speedup in both wall-clock time and sample efficiency compared to a reinforcement learning approach and superior performance relative to static-weight methods.

Algorithm	Training Time [s]	Samples
MOBO-MPC Multi-Objective Bayesian Optimization MPC	1071	460,278
Diff-MPC (Ours) Differentiable MPC with Fixed Weights	235	88,996
RL-WMPC Reinforcement Learning Weights-varying MPC	3885	1,000,000
Diff-WMPC (Ours) Differentiable Weights-varying MPC	101	36,778

Competitive Performance with State-of-the-Art Algorithms

Despite the drastically reduced training time, Diff-WMPC achieves competitive performance with weights-varying state-of-the-art algorithms on the Monza racetrack. The results demonstrate effective learning of the weights-varying policy and clear advantages over static-weight approaches.

Performance comparison on Monza racetrack

Performance metrics comparing Diff-WMPC against baseline methods on the Monza racetrack.

Adaptive Weight Variation Around the Track

The learned policy dynamically adjusts MPC weights based on upcoming track features, balancing tracking accuracy and control effort as conditions vary around the circuit.

Demonstration of the adaptive control strategy.

Weight adaptation visualization showing how the policy adjusts parameters around the race track.

Zero-Shot Transfer and Online Adaptation

To demonstrate robustness, we deploy the policy trained on Monza to an entirely unseen track (Laguna Seca) in zero-shot fashion, while introducing additional dynamical model mismatch. Quick online fine-tuning enables the policy to rapidly adapt to environment-specific characteristics.

Adapted control on unseen Laguna Seca racetrack.

Zero-shot transfer and online adaptation on Laguna Seca

Performance on Laguna Seca circuit showing zero-shot transfer capability and improvements from online adaptation.

▶ Appendix A: Implementation Details

Parameter	Value/Setting
Learning Optimization
Optimizer	Adam
Learning Rate	$3 \times 10^{-5}$
Betas	$(0.5, 0.99)$
Batch Size ($N_{\text{batch}}$)	10 timesteps
Gradient Clipping	$[-0.1, 0.1]$
Loss Function Weights	$\alpha=2.5\text{e-}5, \beta=2.25, \gamma=2.5\text{e-}7, \delta=9\text{e-}3$
Policy Architecture
Network Type	Fully Connected
Input Features	Future Velocity ($v_{k \dots k+4}$), Curvature ($c_{k \dots k+4}$)
Hidden Layers	2
Hidden Units	128 per layer
Activation Function	Softplus (Output Only)
Output Dimension	6 (MPC cost weights $\mathbf{q}, \mathbf{r}$)
Initial Output Weights	$\mathbf{q}_0=[2.5, 2.9, 2.0, 5.0], \mathbf{r}_0=[4.3, 6.8]$
MPC Solver Settings
Cost Type	External Cost
Dynamics Model	Discrete Dynamics (RK4)
Implementation	C-Code Generation & Compilation
NLP Solver	SQP_RTI
QP Solver	FULL_CONDENSING_HPIPM
QP/KKT Tolerance	$1\text{e-}6$
Hessian Approximation	Exact
Sensitivity Calculation	Forward Sensitivities
Sensitivity Shooting Node	1
Control & Horizon Parameters
Prediction Horizon ($T_p$)	$2.55\,s$
Discretization Time ($T_s$)	$0.075\,s$
Shooting Nodes ($N$)	$34$

▶ Appendix B: Practical Stability and Feasibility

We actively maintain practical and empirically validated stability and feasibility through specific regularizations and safeguards within the learning framework.

Our approach to practical stability and feasibility while performing local adaptation of our MPC cost function weights relies on three types of strategies: (1) NMPC problem design, (2) stable and accurate gradients calculation, and (3) architectural safeguards in the learning policy. We will discuss each strategy in the following sections.

1. Underlying NMPC Formulation

We rely on established trajectory-tracking NMPC formulations and explicitly separate the learning of cost function weights from the constraints.

Separation of Cost Function Weights Learning and Constraints: It is crucial to note that the learning process adapts the cost function weights $\boldsymbol{\theta}$, but does not alter the state-, input-, and terminal constraints $\boldsymbol{h}\big(\boldsymbol{x}(t), \boldsymbol{u}(t)\big) \le \bar{\boldsymbol{h}}$ and $\boldsymbol{h}^e\big(\boldsymbol{x}(T_p)\big) \le \bar{\boldsymbol{h}}^e$. All constraints, including the combined longitudinal-lateral acceleration limits, remain unchanged.
Terminal Cost Role: In our setting, the terminal cost $m(x(T_p);\theta)$ does not approximate an infinite horizon cost-to-go derived from the Hamilton-Jacobi-Bellman equation, typically one of the central pillars for stability proofs of an MPC. Instead, the terminal cost is defined as the state-dependent component of the stage cost. This design aligns with the NMPC formulations by Grüne [1], Limon et al. [2], Boccia et al. [3], Köhler et al. [4], and Soloperto et al. [5], which all avoid calculating terminal sets or costs. Köhler explicitly argues the benefit of this design for the "general tracking of dynamic reference trajectories" [4]. These theoretical approaches prove that a sufficiently long prediction horizon ensures stability. We experimentally determine an adequate prediction horizon of 2.55 s (Paper, Table I), which effectively balances practical stability and computational load in our application.
Feasible Warm-Starts: To improve practical feasibility, we utilize the SQP_RTI solution method (see Appendix A). The solver is warm-started with the shifted optimal trajectory from the previous MPC step, providing an interior-point feasible initial guess that helps preserve constraint satisfaction under bounded disturbances.

2. Differentiable MPC Solver Properties for Gradient Calculation

To avoid numerical instabilities during cost function weights learning, we adopt the following strategies:

Smoothing for Stability: Active set changes can introduce non-differentiable kinks in the solution map, as shown in Frey et al. [6]. To mitigate this and ensure stable gradients, we solve the KKT system to an accuracy of $\tau = 10^{-6}$ (Section IV-A). As shown by Kordabad et al. [7] and Frey et al. [6], setting $\tau > 0$ ensures that solution sensitivities remain bounded in magnitude, whereas $\tau \approx 0$ can lead to unbounded sensitivities at kinks caused by active set changes.
Sensitivity Accuracy: We adopt an exact Hessian formulation (Section IV-A) in the numerical MPC solver. Frey et al. (Fig. 2) [6] show that using exact Hessians significantly improves the calculation of the solution's sensitivities, in comparison with Gauss-Newton Hessian approximations.

These design choices reduce the risk that the MPC solver returns inaccurate sensitivities, which may negatively affect the learning process and lead to instability or infeasibility.

3. Architectural Safeguards

During our experiments, we identified specific scenarios where an unconstrained learning process drove the MPC into unstable or infeasible regions. We implemented the following architectural safeguards to address these observed failure modes:

Infeasibility Fallback Mechanism: In cases where the numerical NMPC solver fails to converge, the computed gradients and learning updates for that timestep are discarded. The weights are set to the last feasible set, and the MPC is reinitialized. If the infeasibility persists for two consecutive timesteps, the training episode is terminated, and the previous policy with the best loss over one lap is saved (Section III).
Enforcing Positive Cost Weights via a Softplus Activation: During certain maneuvers, such as braking zones, we observed that the gradient descent algorithm could occasionally drive specific cost weights toward negative values in an attempt to optimize its path tracking behavior. This violates the requirements of the HPIPM QP solver, which needs a least-square convex cost function, leading to solver infeasibility. To impose the positivity of all cost function weights during training, we implemented a softplus activation function at the output of our policy network (Section III). This guarantees that all cost weights $\boldsymbol{\theta}_k$ remain strictly positive definite, ensuring the underlying optimization problem remains well-posed and feasible throughout the learning process. The softplus activation function provides an additional benefit, as theoretical stability guarantees for weights-varying MPC, such as the approach by Baumgärtner et al. [8], require continuous stage costs, which a softplus activation function offers. This continuity is crucial for the modularity described in this response's summary, ensuring that the necessary assumptions for stability are preserved whenever a provably stable MPC is integrated into our learning framework. However, formal considerations regarding the NMPC stability during online weight adaptation are left for future work.
Stabilizing Updates via Gradient Clipping, Accumulation, and Conservative Learning Rates: We observed momentary, large spikes in the loss function gradient terms of our Differentiable Weights-Varying MPC. Such large gradients caused the policy to make erratic, large updates that could push the controller into unstable regions in a single step. To mitigate this, we implemented gradient clipping and accumulated gradients over a batch before applying an update (Section III). In our application, gradients are clipped to the range $[-0.1, +0.1]$, and we set the batch size $N_{batch}$ to 10 timesteps.

Additionally, we used a conservative learning rate ($2.9 \times 10^{-5}$ for our experiments), which is low enough to prevent rapid learning that could introduce instability. This combined approach effectively smooths out the policy updates and helps avoid sudden behavioral changes that could destabilize the vehicle.
Including the Control Inputs into the Loss Function: In the initial phases of our experiments, our loss function (Equation 4) focused primarily on minimizing path tracking and velocity deviations. However, we observed that without explicit regularization terms in the loss, the learned policy lacked sufficient incentive to maintain stable control inputs. In these cases, larger jerk and steering rate inputs could lead to expected, momentarily improved path tracking behavior by the MPC, but introduce instability in subsequent timesteps, ultimately leading our vehicle to spin out. We therefore introduced conditional penalties (Equations 4-5, Section III) that apply a standard quadratic cost within a nominal operating range, but transition to a steep exponential penalty when control inputs exceed defined thresholds. This formulation enables the policy to utilize the vehicle's full agility within safe limits while imposing incentives against instability. We observed that scenarios that previously resulted in spin-outs now track the reference line in a stable manner.

Summary: While we do not claim theoretical guarantees for recursive feasibility or stability, we provide useful measures to favor empirical/practical stability and feasibility. By building directly upon the ACADOS framework, our implementation offers clear, standardized interfaces. This modularity allows users to easily replace the underlying control formulation with their own custom MPC implementation, including those designed with inherent theoretical guarantees for stability and recursive feasibility.

References

L. Grüne, "NMPC without terminal constraints," IFAC Proceedings Volumes, vol. 45, no. 17, pp. 1–13, 2012, 4th IFAC Conference on Nonlinear Model Predictive Control.
D. Limon, T. Alamo, and E. Camacho, "Stable constrained MPC without terminal constraint," in Proceedings of the 2003 American Control Conference, vol. 6, 2003, pp. 4893–4898.
A. Boccia, L. Grüne, and K. Worthmann, "Stability and feasibility of state constrained MPC without stabilizing terminal constraints," Systems & Control Letters, vol. 72, pp. 14–21, 2014.
J. Köhler, M. A. Müller, and F. Allgöwer, "Nonlinear Reference Tracking: An Economic Model Predictive Control Perspective," IEEE Transactions on Automatic Control, vol. 64, no. 1, pp. 254–269, Jan. 2019.
R. Soloperto, J. Köhler, and F. Allgöwer, "A nonlinear MPC scheme for output tracking without terminal ingredients," IEEE Transactions on Automatic Control, vol. 68, no. 4, pp. 2368–2375, 2023.
J. Frey, K. Baumgärtner, G. Frison, D. Reinhardt, J. Hoffmann, L. Fichtner, S. Gros, and M. Diehl, "Differentiable Nonlinear Model Predictive Control," 2025.
A. Bahari Kordabad, W. Cai, and S. Gros, "MPC-based reinforcement learning for economic problems with application to battery storage," in 2021 European Control Conference (ECC), 2021, pp. 2573–2578.
K. Baumgärtner, A. Zanelli, and M. Diehl, "Stability analysis of nonlinear model predictive control with progressive tightening of stage costs and constraints," IEEE Control Systems Letters, vol. 7, pp. 3018–3023, 2023.

▶ Appendix C: Robustness to Initialization

10 Random MPC Cost-Weight Initializations under Model Match at Monza

Parameter / Metric	Value
Number of Samples	10
Initial MPC Weight Range	$6 \times [0, 10]$

Pre-Training RMSE Range (Lateral Deviation) [m]	0.018–0.031
Post-Training RMSE Range (Lateral Deviation) [m]	0.014–0.015

Pre-Training RMSE Range (Velocity Deviation) [m/s]	0.781–1.216
Post-Training RMSE Range (Velocity Deviation) [m/s]	0.356–0.358

Pre-Training Mean Accumulated Loss (per Lap)	$6.9 \pm 2.1$
Post-Training Mean Accumulated Loss (per Lap)	$2.5 \pm 0.1$

30 Random MPC Cost-Weight Initializations under Model Mismatch at Monza

Parameter / Metric	Value
Number of Samples	30
Initial MPC Weight Range	$6 \times [0, 10]$

Pre-Training RMSE Range (Lateral Deviation) [m]	0.152–0.422
Post-Training RMSE Range (Lateral Deviation) [m]	0.090–0.091

Pre-Training RMSE Range (Velocity Deviation) [m/s]	0.609–1.561
Post-Training RMSE Range (Velocity Deviation) [m/s]	0.345–0.369

Pre-Training Mean Accumulated Loss (per Lap)	$886.8 \pm 503.4$
Post-Training Mean Accumulated Loss (per Lap)	$90.1 \pm 0.46$

10 Random MPC Cost-Weight Initializations under Model Mismatch at Laguna Seca

Parameter / Metric	Value
Number of Samples	10
Initial MPC Weight Range	$6 \times [0, 10]$

Pre-Training RMSE Range (Lateral Deviation) [m]	0.081–0.174
Post-Training RMSE Range (Lateral Deviation) [m]	0.043–0.045

Pre-Training RMSE Range (Velocity Deviation) [m/s]	0.628–1.138
Post-Training RMSE Range (Velocity Deviation) [m/s]	0.217–0.226

Pre-Training Mean Accumulated Loss (per Lap)	$144.5 \pm 75.5$
Post-Training Mean Accumulated Loss (per Lap)	$17.0 \pm 0.6$

BibTeX

@ARTICLE{11373898,
  author={Jahncke, Felix and Zarrouki, Baha and Piccinini, Mattia and D'sa, Jovin and Isele, David and Bae, Sangjae and Betz, Johannes},
  journal={IEEE Robotics and Automation Letters}, 
  title={Differentiable Weights-Varying Nonlinear MPC via Gradient-based Policy Learning: An Autonomous Vehicle Guidance Example}, 
  year={2026},
  volume={},
  number={},
  pages={1-8},
  keywords={Sensitivity;Costs;Vectors;Training;Cost function;Adaptation models;Predictive control;Tuning;Predictive models;Bayes methods;Machine Learning for Robot Control;Optimization and Optimal Control;Differentiable MPC;Weights-Varying MPC},
  doi={10.1109/LRA.2026.3662644}}
}

Explore more from the authors

Technical University of Munich - AVS

Honda Research Institute USA

Want to learn more about TUM Autonomous Racing?