.. currentmodule:: mpcrl ------------------ Base agent classes ------------------ In this section we discuss the various base agent clases that are available in :mod:`mpcrl` in more details. :class:`Agent` -------------- As aforementioned, the basic agent is implemented in :class:`Agent`. This class takes in an MPC controller, but has no feature for learning from the environment. In fact, it can evaluate the controller's performance in the environment via the :meth:`Agent.evaluate`, but it does not accept a :class:`LearnableParametersDict` instance at instantiation, because it cannot learn. However, :meth:`Agent.__init__` does accept a ``fixed_parameters`` argument that the user can leverage to pass a dictionary of fixed, non-learned parameters of the MPC controller to the agent. That said, the :class:`Agent` class provides a method for evaluation of both the state-value function :math:`V_\theta(s)` and the action-value function :math:`Q_\theta(s,a)`, both of which are computed with the MPC controller as the underlying function approximation with parameters :math:`\theta`, though only the former is used in the evaluation of the agent's performance. See :meth:`Agent.state_value` and :meth:`Agent.action_value` respectively. Moreover, the agent accepts various other arguments at instantiation, allowing for further customization of its behaviour * ``exploration``: whether and how the MPC policy should be perturbed to induce exploration. * ``warmstart``: how the MPC optimization problem should be warmstarted, if at all. This is especially useful when the optimization is highly nonlinear. * ``use_last_action_on_fail`` specifies how to handle failures of the solver, and which action to pass to the environment in such cases. * ``remove_bounds_on_initial_action`` removes automatically the bounds on the initial action in the action-value function approximation. Overall, this class lays the foundation for the rest of the learning agents, but is not the focus of the library. Nevertheless, it comes in handy for benchmarking and testing the MPC controller in the environment prior to the application of any learning strategy, or to generate expert off-policy rollout sequences. :class:`LearningAgent` ---------------------- Still an abstract class, :class:`LearningAgent` inherits from :class:`Agent` and adds features enabling the agent to learn from interactions with the environment. However, being abstract, it cannot be instantiated directly, and it only offers the foundation for other concrete implementations. It is agnostic of the underlying learning method, so it is not meant to be restricted to RL algorithms, but any learning algorithms. Its constructor adds the following arguments on top of those from :class:`Agent`: - ``learnable_parameters``: first and foremost, it accepts at instantiation an instance of :class:`LearnableParametersDict`. This class is further discussed in :ref:`user_guide_learnable_parameters`, but it suffices to say that it must be used to indicate the subset of the MPC parametrization that should be learned via the learning algorithm. - ``update_strategy``: this argument defines the update strategy of the learning agent, i.e., when and with what frequencies should updates take place. This argument can be an :class:`int`, let us say ``n``, at which point the agent will update its parameters every ``n`` time steps (if time steps are the default update frequency for that class). Otherwise, for further customization, an instance of :class:`UpdateStrategy` should be provided, allowing to specify both the frequency and the hooking for the updates. For more informations, see :ref:`user_guide_updating`. - ``experience``: this argument represents the experience replay buffer, used to store interactions (but really, anything the user wishes to store) generated via interactions with the environment. Again, an :class:`int` can be passed, which will prompt the creation of a buffer with size ``n`` which, when sampled, will return all its ``n`` items. ``None`` can also be passed, in which case a unitary length buffer is created, i.e., the agent does not store any experience aside from the very last one. Alternatively, an instance of :class:`ExperienceReplay` can be passed to fine-tune the buffer size and sampling strategy. See :ref:`user_guide_experience` for more details. :class:`LearningAgent` provides two additional methods: - Of lesser notice, :meth:`LearningAgent.store_experience` can be used to store any experience item to the experience buffer, without requiring the user to do so manually. - More importantly, :meth:`LearningAgent.train` is the method that must be called to initiate training of the agent. It takes an environment as input and runs the training loop, which consists of calling the abstract method :meth:`LearningAgent.train_one_episode` for the specified number of episodes. :meth:`LearningAgent.train` accepts various other arguments, most notably, ``behaviour_policy``, which allows to provide a different policy for the agent to learn from. This is the case for off-policy RL algorithms such as Q-learning. This class also introduces the following abstract methods that must be implemented by subclasses: - :meth:`LearningAgent.train_one_episode`: this method is called by :meth:`LearningAgent.train` to train the agent for one episode. It must be implemented by subclasses, and it is where the actual learning takes place. The implementation obviously differs from algorithm to algorithm, but it must take care of calling the hooks provided by the mixin class :class:`core.callbacks.AgentCallbackMixin` (:meth:`core.callbacks.AgentCallbackMixin.on_mpc_failure`, :meth:`core.callbacks.AgentCallbackMixin.on_env_step`, :meth:`core.callbacks.AgentCallbackMixin.on_timestep_end`) so that the other components are triggered correctly. - :meth:`LearningAgent.update`: this method is called by the update strategy to update the agent's parameters. It must be implemented by subclasses, and it is where the actual learning takes place. It assumes that :meth:`LearningAgent.train_one_episode` is implemented correctly, triggering the callbacks as needed, and thus triggering also the updates. In case the update fails, it must return an error message to be raised as an exception or warning, and it must return ``None`` otherwise. :class:`RlLearningAgent` (gradient-based) ----------------------------------------- Building from :class:`LearningAgent`, this is the base abstract class for all gradient-based RL algorithms. Why gradient-based? Because at instantiation, it accepts an instance of :class:`optim.GradientBasedOptimizer`, which, as the name suggests, implements an update strategy that leverages (first and possibly second order) gradient information of the specific learning loss to guide the updates to the MPC parametrization. See :ref:`user_guide_optim` for more discussion on available optimizers. How these Jacobian and Hessian matrices are computed depends on the RL algorithm at hand, if possible at all. Lastly, the class constructor also accepts a ``discount_factor``. :class:`GlobOptLearningAgent` (gradient-free) --------------------------------------------- TODO