Base agent classes#

In this section we discuss the various base agent clases that are available in mpcrl in more details.

`Agent`#

As aforementioned, the basic agent is implemented in Agent. This class takes in an MPC controller, but has no feature for learning from the environment. In fact, it can evaluate the controller’s performance in the environment via the Agent.evaluate, but it does not accept a LearnableParametersDict instance at instantiation, because it cannot learn. However, Agent.__init__ does accept a fixed_parameters argument that the user can leverage to pass a dictionary of fixed, non-learned parameters of the MPC controller to the agent.

That said, the Agent class provides a method for evaluation of both the state-value function \(V_\theta(s)\) and the action-value function \(Q_\theta(s,a)\), both of which are computed with the MPC controller as the underlying function approximation with parameters \(\theta\), though only the former is used in the evaluation of the agent’s performance. See Agent.state_value and Agent.action_value respectively.

Moreover, the agent accepts various other arguments at instantiation, allowing for further customization of its behaviour

exploration: whether and how the MPC policy should be perturbed to induce exploration.
warmstart: how the MPC optimization problem should be warmstarted, if at all. This is especially useful when the optimization is highly nonlinear.
use_last_action_on_fail specifies how to handle failures of the solver, and which action to pass to the environment in such cases.
remove_bounds_on_initial_action removes automatically the bounds on the initial action in the action-value function approximation.

Overall, this class lays the foundation for the rest of the learning agents, but is not the focus of the library. Nevertheless, it comes in handy for benchmarking and testing the MPC controller in the environment prior to the application of any learning strategy, or to generate expert off-policy rollout sequences.

`LearningAgent`#

Still an abstract class, LearningAgent inherits from Agent and adds features enabling the agent to learn from interactions with the environment. However, being abstract, it cannot be instantiated directly, and it only offers the foundation for other concrete implementations. It is agnostic of the underlying learning method, so it is not meant to be restricted to RL algorithms, but any learning algorithms.

Its constructor adds the following arguments on top of those from Agent:

learnable_parameters: first and foremost, it accepts at instantiation an instance of LearnableParametersDict. This class is further discussed in Learnable parameters, but it suffices to say that it must be used to indicate the subset of the MPC parametrization that should be learned via the learning algorithm.
update_strategy: this argument defines the update strategy of the learning agent, i.e., when and with what frequencies should updates take place. This argument can be an int, let us say n, at which point the agent will update its parameters every n time steps (if time steps are the default update frequency for that class). Otherwise, for further customization, an instance of UpdateStrategy should be provided, allowing to specify both the frequency and the hooking for the updates. For more informations, see Updating.
experience: this argument represents the experience replay buffer, used to store interactions (but really, anything the user wishes to store) generated via interactions with the environment. Again, an int can be passed, which will prompt the creation of a buffer with size n which, when sampled, will return all its n items. None can also be passed, in which case a unitary length buffer is created, i.e., the agent does not store any experience aside from the very last one. Alternatively, an instance of ExperienceReplay can be passed to fine-tune the buffer size and sampling strategy. See Experience for more details.

LearningAgent provides two additional methods:

Of lesser notice, LearningAgent.store_experience can be used to store any experience item to the experience buffer, without requiring the user to do so manually.
More importantly, LearningAgent.train is the method that must be called to initiate training of the agent. It takes an environment as input and runs the training loop, which consists of calling the abstract method LearningAgent.train_one_episode for the specified number of episodes. LearningAgent.train accepts various other arguments, most notably, behaviour_policy, which allows to provide a different policy for the agent to learn from. This is the case for off-policy RL algorithms such as Q-learning.

This class also introduces the following abstract methods that must be implemented by subclasses:

LearningAgent.train_one_episode: this method is called by LearningAgent.train to train the agent for one episode. It must be implemented by subclasses, and it is where the actual learning takes place. The implementation obviously differs from algorithm to algorithm, but it must take care of calling the hooks provided by the mixin class core.callbacks.AgentCallbackMixin (core.callbacks.AgentCallbackMixin.on_mpc_failure, core.callbacks.AgentCallbackMixin.on_env_step, core.callbacks.AgentCallbackMixin.on_timestep_end) so that the other components are triggered correctly.
LearningAgent.update: this method is called by the update strategy to update the agent’s parameters. It must be implemented by subclasses, and it is where the actual learning takes place. It assumes that LearningAgent.train_one_episode is implemented correctly, triggering the callbacks as needed, and thus triggering also the updates. In case the update fails, it must return an error message to be raised as an exception or warning, and it must return None otherwise.

`RlLearningAgent` (gradient-based)#

Building from LearningAgent, this is the base abstract class for all gradient-based RL algorithms. Why gradient-based? Because at instantiation, it accepts an instance of optim.GradientBasedOptimizer, which, as the name suggests, implements an update strategy that leverages (first and possibly second order) gradient information of the specific learning loss to guide the updates to the MPC parametrization. See Optimizers for more discussion on available optimizers. How these Jacobian and Hessian matrices are computed depends on the RL algorithm at hand, if possible at all. Lastly, the class constructor also accepts a discount_factor.

`GlobOptLearningAgent` (gradient-free)#

TODO

Base agent classes#

Agent#

LearningAgent#

RlLearningAgent (gradient-based)#

GlobOptLearningAgent (gradient-free)#

This Page

`Agent`#

`LearningAgent`#

`RlLearningAgent` (gradient-based)#

`GlobOptLearningAgent` (gradient-free)#