Before we discuss the inner workings of the library, an introduction to the theory
behind the MPC-based RL framework is necessary. This section will provide a brief
overview of the main concepts and ideas that are at its core, and the mathmeticas behind
them.

----------------------
Reinforcement Learning
----------------------

Reinforcement Learning (RL) :cite:`sutton_reinforcement_2018` is a subfield of Machine
Learning that deals with the problem of learning how to make decisions in an environment
in order to maximize some rewards or, as in the context of control, to minimize some
costs. It is heavily related to optimal control theory and Dynamic Programming (DP), so
much so that sometimes it is difficult to distinguish between them. In fact, all of
these fields are concerned with *learning a policy* that will dictate the actions to
take in order to achieve some goal in the given system/environment.

Consider a Markov Decision Process (MDP) defined by state :math:`s`, action :math:`a`,
and a state transition :math:`s \xrightarrow{a} s_+` with the underlying conditional
probability density

.. math::
   \mathbb{P}\left[s_+ | s, a\right] :
   \mathbb{S} \times \mathbb{S} \times \mathbb{A} \rightarrow \left[0, 1\right]

where :math:`\mathbb{S}` and :math:`\mathbb{A}` are the state and action space,
respectively. Such MDP is very generic and can represent, e.g., the model of a
discrete-time system. The performance of a given deterministic policy
:math:`\pi_\theta : \mathbb{S} \rightarrow \mathbb{A}`, parametrized in
:math:`\theta \in \Theta`,
is defined as

.. math::
   J(\pi_\theta) := \mathbb{E} \left[
      \sum_{k=0}^{\infty}{\gamma^k L \bigl(s_k, \pi_\theta(s_k)\bigr)}
   \right]

where :math:`\gamma \in (0,1]` is the discount factor, and
:math:`L : \mathbb{S} \times \mathbb{A} \rightarrow \mathbb{R}` is the stage-cost
function. The goal of RL is then to find the optimal policy

.. math:: \pi_\theta^\star = \arg\min_{\theta \in \Theta} J(\pi_\theta)

by learning from the interaction with the environment. In other words, the algorithm is
only allowed to observed, for each state and action pair :math:`s_k,a_k`, the immediate
cost realization :math:`L(s_k,a_k)` and the next state :math:`s_{k+1}`. Aside the
policy, other important quantities to introduce (here defined in their parametric form,
but the general case holds) are the state value function :math:`V_\theta(s)` and the
state-action value function :math:`Q_\theta(s,a)`. The former is defined as

.. math::
   V_\theta(s) := \mathbb{E} \left[
      \sum_{k=0}^{\infty}{\gamma^k L \bigl(s_k, \pi_\theta(s_k)\bigr) \ | \ s_0 = s}
   \right]

and is used to evaluate the performance of the policy given the current state (which is
imposed as initial), while the latter is defined as

.. math::
   Q_\theta(s,a) := \mathbb{E} \left[
      \sum_{k=0}^{\infty}{\gamma^k L \bigl(s_k, a_k\bigr) \ | \ s_0 = s, a_0 = a}
   \right]

and evaluates the performance of the policy given the current state as well as the
first action. Since it is in general impossible to find and characterise the true
unknown optimal policy and value functions, function approximation techniques (such as
neural networks and, as in this library, MPC) have been employed as a powerful
alternative for tackling this problem.