------------------------ Model Predictive Control ------------------------ Model Predictive Control (MPC) :cite:`rawlings_model_2017` is a well-known predictive methodology in the context of optimal control. Why is it predictive? Because at its core lies a model of the system (i.e., a mathematical description of the dynamics of the system the MPC controller is supposed to control) that is used to predict how said system will evolve when different control actions are applied to it. Then, the MPC scheme selects the optimal actions to apply to the system based on a user-defined cost criterion and constraints. Most usually, MPC controllers are used in a closed-loop setup and in a receding horizon fashion: at each time step, the controller receives a measurement of the current state of the system, solves the aforementioned optimization problem to find the sequence of optimal actions to apply to the system, applies only the first optimal action from this sequence to the system, and then waits for the next measurement to repeat the process. Mathematically speaking, consider a discrete-time system described at time step :math:`k` by .. math:: s_{k+1} = f(s_k,a_k), where :math:`f` represent the (possibly nonlinear and/or stochastic) dynamics of the system/environment, :math:`s_k` is its state, and :math:`a_k` the action applied to it. A very generic MPC controller looks like .. math:: \begin{aligned} \min_{x_{0:N}, u_{0:N-1}} \quad & \lambda(x_0) + \sum_{i=0}^{N-1}{ \gamma^i \ell(x_i,u_i) } + \gamma^N T(x_N) & \\ \textrm{s.t.} \quad & x_0 = s_k \\ & x_{i+1} = f(x_i, u_i) \quad & i=0,\dots,N-1 \\ & h(x_i,u_i) \leq 0 \quad & i=0,\dots,N-1 \\ & h_f(x_N) \leq 0, \end{aligned} where :math:`N` is the so-called prediction horizon, :math:`x_i` and :math:`u_i` are the states and actions at time step :math:`i` over the horizon, :math:`\lambda` is the initial cost function, :math:`\ell` the stage cost, :math:`T` the terminal cost, and :math:`h` and :math:`h_f` are inequality constraints. The solution to this optimization problem, :math:`u_0^\star`, is the action that is then applied to the system, i.e., :math:`a_k = u_0^\star`. Parametric MPC ============== More often than not, the dynamics of the system are not known exactly, but is known to belong to (or at least, is well-approximated by) a parametric family of models :math:`f_\theta`, where :math:`\theta` usually refers to such parameters. Likewise, the cost terms and constraints are usually also functions of some parameters that the designer can tune. In this case, the MPC problem, parametrized by :math:`\theta`, is generically described by .. math:: \begin{aligned} \min_{x_{0:N}, u_{0:N-1}} \quad & \lambda_\theta(x_0) + \sum_{i=0}^{N-1}{ \gamma^i \ell_\theta(x_i,u_i) } + \gamma^N T_\theta(x_N) & \\ \textrm{s.t.} \quad & x_0 = s_k \\ & x_{i+1} = f_\theta(x_i, u_i) \quad & i=0,\dots,N-1 \\ & h_\theta(x_i,u_i) \leq 0 \quad & i=0,\dots,N-1 \\ & h_{f,\theta}(x_N) \leq 0, \end{aligned} The challenge in designing a good MPC controller is then to find the optimal parameters (where we still have to define what "optimal" means) that will make the controller perform well in practice when deployed onto the system. This is usually done by manual tuning; however, in the context of high nonlinearities, stochasticity, or high dimension of the system, this can be a daunting task. This is where reinforcement learning will come in handy.