-----------------------------------------------------
Reinforcement Learning using Model Predictive Control
-----------------------------------------------------

If you have followed along, it is not difficult to see that the parametric MPC scheme
that was discussed above is more than suitable for being used as a policy provider in
the context of RL. This concept was first introduced and properly formulated in
:cite:`gros_datadriven_2020`.


MPC as function approximation
=============================

In fact, the MPC scheme naturally acts as a policy provider, so the definition of its
policy in the context of RL follows naturally as

.. math:: a_k = u_0^\star = \pi_\theta(s_k).

What's more, as shown in :cite:`gros_datadriven_2020`, the MPC controller can also be
employed to approximate the value functions as

.. math::
   \begin{aligned}
      V_\theta(s_k) = \min_{x_{0:N}, u_{0:N-1}} \quad &
         \lambda_\theta(x_0) + \sum_{i=0}^{N-1}{ \gamma^i \ell_\theta(x_i,u_i) }
         + \gamma^N T_\theta(x_N) & \\
      \textrm{s.t.} \quad & x_0 = s_k \\
                          & x_{i+1} = f_\theta(x_i, u_i) \quad & i=0,\dots,N-1 \\
                          & h_\theta(x_i,u_i) \leq 0 \quad & i=0,\dots,N-1 \\
                          & h_{f,\theta}(x_N) \leq 0,
   \end{aligned}

and

.. math::
   \begin{aligned}
      Q_\theta(s_k,a_k) = \min_{x_{0:N}, u_{0:N-1}} \quad &
         \lambda_\theta(x_0) + \sum_{i=0}^{N-1}{ \gamma^i \ell_\theta(x_i,u_i) }
         + \gamma^N T_\theta(x_N) & \\
      \textrm{s.t.} \quad & x_0 = s_k \\
                          & a_0 = a_k \\
                          & x_{i+1} = f_\theta(x_i, u_i) \quad & i=0,\dots,N-1 \\
                          & h_\theta(x_i,u_i) \leq 0 \quad & i=0,\dots,N-1 \\
                          & h_{f,\theta}(x_N) \leq 0.
   \end{aligned}

The Bellman relationships also hold, with

.. math::
   \pi_\theta(s) = \arg\min_{a \in \mathbb{A}} Q_\theta(s,a), \quad
   V_\theta(s) = \arg\min_{a \in \mathbb{A}} Q_\theta(s,a).

However, approximating the policy and the value functions with some function
approximation scheme is only half of the story. The other half is to understand how to
adjust the parameters :math:`\theta` of such paramatric approximation in order to
improve the RL performance and minimize the incurred costs. This is where famous
gradient-based RL algorithms come into play. Nevertheless, to apply these algorithms,
the gradient of the MPC quantities w.r.t. the parameters :math:`\theta` must be
evaluated. To do so, :cite:`gros_datadriven_2020` proposed to leverage nonlinear
sensitivity analysis techniques :cite:`buskens_sensitivity_2001` that exploit the KKT
conditions to compute such sensitivities.

Q-learning with MPC
===================

Unsurprisingly, the Q-learning algorithm can be employed to tune the parameters of the
MPC controller to improve its performance. The underlying idea of Q-learning is to
approximate as best as possible the unknown optimal Q-function :math:`Q^\star` by
minimizing the Bellman residual, i.e.,

.. math::
   \min_{\theta \in \Theta} \mathbb{E} \left[
      \left\lVert Q^\star(s,a) - Q_\theta(s,a) \right\rVert^2
   \right].

This can be (approximately) achieved with the famous update rule

.. math:: \theta \leftarrow \theta + \alpha \delta_k \nabla_\theta Q_\theta(s_k,a_k),

where :math:`\alpha` is the learning rate and :math:`\delta_k` is the Temporal
Difference (TD) error at time step :math:`k`. :cite:`esfahani_approximate_2021` improves
upon the update above and embeds it with second order information, i.e., it includes not
only the gradient of the approximation, but also an estimate of its hessian.

So far, these concepts are pretty standard to Q-learning. The real question is, how can
we compute :math:`\nabla_\theta Q_\theta(s_k,a_k)` when the action value function is
provided by an MPC optimization scheme? It turns out that the answer is not very
complex, and according to :cite:`buskens_sensitivity_2001` we have that

.. math:: \nabla_\theta Q_\theta(s_k,a_k) =
      \nabla_\theta \mathcal{L}(y^\star, \theta),

where :math:`\mathcal{L}_\theta` is the Lagrangian of the MPC optimization problem
evaluated at the optimal primal-dual solution :math:`y^\star` of the NLP problem.

Examples :ref:`examples_qlearning` and :ref:`examples_qlearning_offpolicy` show how
Q-learning can be used to tune the parameters of an MPC controller for a linear task
both in a on-policy and off-policy fashion.


Deterministic Policy Gradient with MPC
======================================

What if, instead of learning the optimal Q-function from data with the hope to
inderectly recover the optimal policy from it, we directly learn the policy that
minimizes the returns directly? This is the idea behind policy gradient methods, which
attempt to estimate :math:`\nabla_\theta J(\pi_\theta)` and use it to update the
parametrization. In other words, the update rule is

.. math:: \theta \leftarrow \theta - \alpha \nabla_\theta J(\pi_\theta).

In particular, :cite:`cai_mpcbased_2021` shows how to use the Deterministic Policy
Gradient (DPG) algorithm. Estimation of the performance gradient is not trivial, but can
be achieved as

.. math::
      \nabla_\theta J(\pi_\theta) = \mathbb{E} \left[
         \nabla_\theta \pi_\theta(s) \nabla_a Q_{\pi_\theta}(s,a) |_{a=\pi_\theta(s)}
      \right].

The gradient of the policy function can be computed as

.. math::
      \nabla_\theta \pi_\theta(s) = -\nabla_\theta K(y^\star,s,\theta)
         \nabla_y K(y^\star,s,\theta)^{-1} \frac{\partial y}{\partial u_0}

where :math:`y` are all the primal-dual variables, and :math:`K` is the KKT system of
optimal conditions associated with the MPC optimization problem. The action-value
function is instead approximated with the compatible form

.. math::
      Q_{\pi_\theta} \approx Q_\omega = \Psi(s,a)^\top \omega + V_\nu(s)

with :math:`\Psi(s,a) = \nabla_\theta \pi_\theta(s) (a - \pi_\theta(s))` and
:math:`V_{\pi_\theta} \approx V_\nu = \Phi(s)^\top \nu`. :math:`\Phi(s)` is a state
feature vector. Hence, we get that

.. math::
      \nabla_a Q_{\pi_\theta}(s,a) \approx \nabla_a Q_\omega(s,a)
      = \nabla_\theta \pi_\theta(s)^\top \omega.

The unknown parameters :math:`\omega` and :math:`\nu` can be computed in a batch way via
a least-squares regression problem.

As for Q-learning, example :ref:`examples_dpg` deploys DPG for a linear task in a
on-policy fashion.