mpcrl.GlobOptLearningAgent#

class mpcrl.GlobOptLearningAgent(optimizer, **kwargs)[source]#

Bases: LearningAgent[SymType, None], ABC

Class for learning agents that employ gradient-free Global Optimization strategies (e.g., Bayesian Optimization) to learn/improve the MPC policy.

Contrary to RlLearningAgent, this class does not require a discount factor, but requires an instance of a optim.GradientFreeOptimizer that adheres to the ask-tell interface, i.e., it must implement the optim.GradientFreeOptimizer.ask and optim.GradientFreeOptimizer.tell methods.

Parameters:
optimizerGradientFreeOptimizer

An instance of optim.GradientFreeOptimizer optimizer to ask for a suggested set of parameters to try out, and later tell the value of the objective function for that suggested set of parameters.

kwargs

Additional arguments to be passed to LearningAgent.

Note: the following kwargs are not yet supported
  • "experience": usually, GO strategies do not require experience replay

  • "update_strategy": updates are fixed at the end of each episode.

Methods

action_value(state, action[, vals0])

Computes the MPC-based action value function approximation \(Q_\theta(s,a)\).

evaluate(*args, **kwargs)

Evaluates the agent in a given environment.

is_wrapped(*_, **__)

Gets whether the agent instance is wrapped or not by the wrapper type.

on_env_step(env, episode, timestep)

Callback called after each call to gymnasium.Env.step.

on_episode_end(env, episode, rewards)

Callback called at the end of each episode in the training or evaluation process (see mpcrl.Agent.evaluate, mpcrl.LearningAgent.train and mpcrl.LearningAgent.train_offpolicy).

on_episode_start(env, episode, state)

Callback called at the beginning of each episode in the training or validation process (see mpcrl.Agent.evaluate, mpcrl.LearningAgent.train and mpcrl.LearningAgent.train_offpolicy).

on_mpc_failure(episode, timestep, status, raises)

Callback in case of failure of the MPC solver.

on_timestep_end(env, episode, timestep)

Callback called at the end of each time iteration.

on_training_end(env, returns)

Callback called at the end of the training process.

on_training_start(env)

Callback called at the beginning of the training process.

on_update()

Callback called after each mpcrl.LearningAgent.update.

on_update_failure(episode, timestep, ...)

Callback in case of update failure.

on_validation_end(env, returns)

Callback called at the end of the validation process (see mpcrl.Agent.evaluate).

on_validation_start(env)

Callback called at the beginning of the validation process (see mpcrl.Agent.evaluate)

reset([seed])

Resets the agent.

state_value(state[, deterministic, vals0, ...])

Computes the MPC-based state value function approximation \(V_\theta(s)\).

store_experience(item)

Stores the given item in the agent's experience for later usage in updating the parametrization.

train(*args, **kwargs)

On-policy training of the agent on an environment.

train_offpolicy(episode_rollouts[, seed, raises])

Off-policy training of the agent on an environment.

train_one_episode(env, episode, init_state)

On-policy training of the agent on an environment for one single episode.

train_one_rollout(rollout, episode[, raises])

Train the agent in an off-policy manner on the given rollout.

update()

Updates the learnable parameters (usually referred to as \(\theta\)) of the MPC according to the agent's learning algorithm.

Attributes

Q

Gets the MPC function approximation of the action value function \(Q_\theta(s,a)\).

V

Gets the MPC function approximation of the state value function \(V_\theta(s)\).

cost_perturbation_method

The name of the method from :class`numpy.random.Generator` to be used to generate perturbations of the cost function in the state value function \(V_\theta(s)\).

cost_perturbation_parameter

The name of the parameter to be added to the original mpc problem for perturbing the state value function \(V_\theta(s)\).

experience

Gets the experience replay memory of the agent.

exploration

Gets the exploration strategy used within this agent to perturb the policy provided by the MPC controller via \(V_\theta(s)\).

fixed_parameters

Gets the fixed parameters of the MPC controller, i.e., the non-learnable ones.

init_action_constraint

Name of the equality constraint to be added to the original mpc problem for constraining the first action to be equal to \(a\) in the action value function \(Q_\theta(s,a)\).

init_action_parameter

Name of the parameter to be added to the original mpc problem for constraining the first action to be equal to \(a\) in the action value function \(Q_\theta(s,a)\).

learnable_parameters

Gets the parameters of the MPC that can be learnt by the agent.

unwrapped

Gets the underlying wrapped instance of an agent.

update_strategy

Gets the update strategy of the agent.

warmstart

Gets the warmstart strategy used within this agent.

property Q: Mpc[SymType]#

Gets the MPC function approximation of the action value function \(Q_\theta(s,a)\).

property V: Mpc[SymType]#

Gets the MPC function approximation of the state value function \(V_\theta(s)\).

action_value(state, action, vals0=None, **kwargs)#

Computes the MPC-based action value function approximation \(Q_\theta(s,a)\).

Parameters:
statearray_like or dict of (str, array_like)

The initial state at which to evaluate the action value function, i.e., \(s\) in \(Q_\theta(s,a)\). It can be either a 1D array representing the value of all initial states of the MPC, concatenated. Otherwise, a dict whose keys are the names of each state, and values are their numerical initial state values.

actionarray_like or dict of (str, array_like), optional

Same for state, but for the action, i.e., the initial action at which to evaluate the MPC action value function, i.e., \(a\) in \(Q_\theta(s,a)\).

vals0dict of (str, array_like) or iterable of, optional

A dict (or an iterable of dict, in case of csnlp.multistart.MultistartNlp is used), whose keys are the names of the MPC variables, and values are the numerical initial values of each variable. Use this argument to warmstart the MPC. If None, and a previous solution (possibly, successful) is available, the MPC solver is automatically warmstarted. If an iterable is passed instead, the warmstarting strategy is bypassed.

Returns:
Solution

The solution of the MPC approximation \(Q_\theta(s,a)\) at the given state and action pair.

Return type:

Solution[TypeVar(SymType, SX, MX)]

cost_perturbation_method = 'normal'#

The name of the method from :class`numpy.random.Generator` to be used to generate perturbations of the cost function in the state value function \(V_\theta(s)\).

cost_perturbation_parameter = 'cost_perturbation'#

The name of the parameter to be added to the original mpc problem for perturbing the state value function \(V_\theta(s)\).

evaluate(*args, **kwargs)#

Evaluates the agent in a given environment.

Parameters:
envEnv[ObsType, ActType]

The gym environment where to evaluate the agent in.

episodesint

Number of evaluation episodes.

deterministicbool, optional

Whether the agent should act deterministically, i.e., applying no exploration to the policy provided by the MPC. By default, True.

seedNone, int, array_like of ints, SeedSequence, BitGenerator, Generator

Seed for the agent’s and env’s random number generator. By default None.

raisesbool, optional

If True, when any of the MPC solver runs fails, or when an update fails, the corresponding error is raised; otherwise, only a warning is raised.

env_reset_optionsdict, optional

Additional information to specify how the environment is reset at each evalution episode (optional, depending on the specific environment).

Returns:
array of doubles

The cumulative returns (one return per evaluation episode).

Raises:
MpcSolverError or MpcSolverWarning

Raises if the MPC optimization solver fails and raises=True.

Return type:

ndarray[tuple[Any, ...], dtype[floating]]

Notes

After solving \(V_\theta(s)\) for the current env’s state s, the action is passed to the environment as the concatenation of the first optimal action variables of the MPC (see csnlp.Mpc.actions).

property experience: ExperienceReplay[ExpType]#

Gets the experience replay memory of the agent.

property exploration: ExplorationStrategy#

Gets the exploration strategy used within this agent to perturb the policy provided by the MPC controller via \(V_\theta(s)\).

property fixed_parameters: None | dict[str, Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]] | Collection[dict[str, Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]]]#

Gets the fixed parameters of the MPC controller, i.e., the non-learnable ones.

Returns:
None or dict of (str, array_like), or collection of
The returned object can be either
  • None, if the MPC controller has no fixed parameters

  • a dict whose keys are the names of the MPC parameters and the values are their corresponding values, when the MPC controller wraps an instance of csnlp.Nlp, or it wraps an instance of csnlp.multistart.MultistartNlp but the same set of parameters is meant to be used for all scenarios

  • a collection of such dictionaries, when the MPC controller wraps an instance of csnlp.multistart.MultistartNlp and different parameters are meant to be used for each scenario.

init_action_constraint = 'a_init'#

Name of the equality constraint to be added to the original mpc problem for constraining the first action to be equal to \(a\) in the action value function \(Q_\theta(s,a)\).

init_action_parameter = 'a_init'#

Name of the parameter to be added to the original mpc problem for constraining the first action to be equal to \(a\) in the action value function \(Q_\theta(s,a)\).

is_wrapped(*_, **__)#

Gets whether the agent instance is wrapped or not by the wrapper type.

Returns:
bool

A flag indicating whether the agent is wrapped or not.

Return type:

bool

property learnable_parameters: LearnableParametersDict#

Gets the parameters of the MPC that can be learnt by the agent.

on_env_step(env, episode, timestep)#

Callback called after each call to gymnasium.Env.step.

Parameters:
envgym env

A gym environment where the agent is being trained on.

episodeint

Number of the training episode.

timestepint

Time instant of the current training episode.

Return type:

None

on_episode_end(env, episode, rewards)#

Callback called at the end of each episode in the training or evaluation process (see mpcrl.Agent.evaluate, mpcrl.LearningAgent.train and mpcrl.LearningAgent.train_offpolicy).

Parameters:
envgym env

A gym environment where the agent is being trained on.

episodeint

Number of the training episode.

rewardsfloat

Cumulative rewards for this episode.

Return type:

None

on_episode_start(env, episode, state)#

Callback called at the beginning of each episode in the training or validation process (see mpcrl.Agent.evaluate, mpcrl.LearningAgent.train and mpcrl.LearningAgent.train_offpolicy).

Parameters:
envgym env

A gym environment where the agent is being trained on.

episodeint

Number of the training episode.

stateObsType

Starting state for this episode.

Return type:

None

on_mpc_failure(episode, timestep, status, raises)#

Callback in case of failure of the MPC solver.

Parameters:
episodeint

Number of the episode when the failure happened.

timestepint or None

Timestep of the current episode when the failure happened. Can be None, in case the error occurs inter-episodically or no notion of time step is available.

statusstr

Status of the solver that failed.

raisesbool

Whether the failure should be raised as exception (True) or as a warning (False).

Return type:

None

on_timestep_end(env, episode, timestep)#

Callback called at the end of each time iteration. It is called with the same frequency as on_env_step, but with different timing.

Parameters:
envgym env

A gym environment where the agent is being trained on.

episodeint

Number of the training episode.

timestepint

Time instant of the current training episode.

Return type:

None

on_training_end(env, returns)#

Callback called at the end of the training process.

Parameters:
envgym env

A gym environment where the agent has been trained on.

returnsarray of double

Each episode’s cumulative rewards.

Return type:

None

on_training_start(env)#

Callback called at the beginning of the training process.

Parameters:
envgym env

A gym environment where the agent is being trained on.

Return type:

None

on_update()#

Callback called after each mpcrl.LearningAgent.update.

This callback is especially useful for, e.g., decaying exploration probabilities or learning rates.

Return type:

None

on_update_failure(episode, timestep, errormsg, raises)#

Callback in case of update failure.

Parameters:
episodeint

Number of the episode when the failure happened.

timestepint or None

Timestep of the current episode when the failure happened. Can be None in case the update occurs inter-episodically or no notion of time step is available.

errormsgstr

Error message of the update failure.

raisesbool

Whether the failure should be raised as exception (True) or as a warning (False).

Return type:

None

on_validation_end(env, returns)#

Callback called at the end of the validation process (see mpcrl.Agent.evaluate).

Parameters:
envgym env

A gym environment where the agent has been validated on.

returnsarray of double

Each episode’s cumulative rewards.

Return type:

None

on_validation_start(env)#

Callback called at the beginning of the validation process (see mpcrl.Agent.evaluate)

Parameters:
envgym env

A gym environment where the agent is being validated on.

Return type:

None

reset(seed=None)#

Resets the agent. This includes resetting the warmstart strategy, the exploration strategy, and the some internal variables of the agent.

Parameters:
seedRngType, optional

The seed to reset the numpy.random.Generator instances. By default, None.

Return type:

None

state_value(state, deterministic=False, vals0=None, action_space=None, **kwargs)#

Computes the MPC-based state value function approximation \(V_\theta(s)\).

Parameters:
statearray_like or dict of (str, array_like)

The initial state at which to evaluate the MPC approximation of the state value function, i.e., \(s\) in \(V_\theta(s)\). It can be either a 1D array representing the value of all initial states of the MPC, concatenated. Otherwise, a dict whose keys are the names of each state, and values are their numerical initial state values.

deterministicbool, optional

If False, the MPC controller is perturbed according to the exploration strategy to induce some exploratory behaviour. Otherwise, no perturbation is performed. By default, False.

vals0dict of (str, array_like) or iterable of, optional

A dict (or an iterable of dict, in case of csnlp.multistart.MultistartNlp is used), whose keys are the names of the MPC variables, and values are the numerical initial values of each variable. Use this argument to warmstart the MPC. If None, and a previous solution (possibly, successful) is available, the MPC solver is automatically warmstarted. If an iterable is passed instead, the warmstarting strategy is bypassed.

action_spacegymnasium.spaces.Box, optional

The action space of the environment the agent is being evaluated/trained on. If not None, it is used in case an additive exploration perturbation is summed to the action in order to clip it back into the action space.

Returns:
casadi.DM

The first optimal action according to the solution of the state value function, possibly perturbed by exploration noise, i.e.,

\[u_0^\star = \arg\min_{u} V_\theta(s)\]
Solution

The solution of the MPC approximation \(V_\theta(s)\) at the given state.

Return type:

tuple[DM, Solution[TypeVar(SymType, SX, MX)]]

store_experience(item)#

Stores the given item in the agent’s experience for later usage in updating the parametrization.

Parameters:
itemExpType

Item to be stored in memory.

Return type:

None

train(*args, **kwargs)[source]#

On-policy training of the agent on an environment.

Parameters:
envEnv[ObsType, ActType]

The gym environment where to train the agent on.

episodesint

Number of training episodes.

seedNone, int, array_like of ints, SeedSequence, BitGenerator, Generator

Seed for the agent’s and env’s random number generator. By default None.

raisesbool, optional

If True, when any of the MPC solver runs fails, or when an update fails, the corresponding error is raised; otherwise, only a warning is raised.

env_reset_optionsdict, optional

Additional information to specify how the environment is reset at each evalution episode (optional, depending on the specific environment).

Returns:
array of doubles

The cumulative returns for each training episode.

Raises:
MpcSolverError or MpcSolverWarning

Raises the error or the warning (depending on raises) if any of the MPC solvers fail.

UpdateError or UpdateWarning

Raises the error or the warning (depending on raises) if the update fails.

Return type:

Any

train_offpolicy(episode_rollouts, seed=None, raises=True)#

Off-policy training of the agent on an environment.

Parameters:
episode_rolloutsiterable of iterables of any

An iterable of episodical rollouts generated in an off-policy fashion. Each rollout is itself a sequence of transitions, e.g., SARSA tuples. In other words, episode_rollouts is a sequence of sequences of tuples. However, in general, its nature and the tuples’ can widely differ from learning algorithm to learning algorithm.

seedNone, int, array_like of ints, SeedSequence, BitGenerator, Generator

Seed for the agent’s random number generator. By default None.

raisesbool, optional

If True, when any of the MPC solver runs fails, or when an update fails, the corresponding error is raised; otherwise, only a warning is raised.

Raises:
MpcSolverError or MpcSolverWarning

Raises the error or the warning (depending on raises) if any of the MPC solvers fail.

UpdateError or UpdateWarning

Raises the error or the warning (depending on raises) if the update fails.

Return type:

None

train_one_episode(env, episode, init_state, raises=True)[source]#

On-policy training of the agent on an environment for one single episode.

Parameters:
envEnv[ObsType, ActType]

The gym environment where to train the agent on.

episodeint

Number of the current training episode.

init_stateobservation type

Initial state/observation of the environment.

raisesbool, optional

If True, when any of the MPC solver runs fails, or when an update fails, the corresponding error is raised; otherwise, only a warning is raised.

Returns:
float

The cumulative rewards for this training episode.

Raises:
MpcSolverError or MpcSolverWarning

Raises the error or the warning (depending on raises) if any of the MPC solvers fail.

UpdateError or UpdateWarning

Raises the error or the warning (depending on raises) if the update fails.

Return type:

float

train_one_rollout(rollout, episode, raises=True)#

Train the agent in an off-policy manner on the given rollout.

Parameters:
rolloutiterable of any

Rollout, i.e., a sequence of transitions generated off-policy, e.g., SARSA tuples. However, in general, these tuples can be of different nature, depending on the specific learning algorithm.

raisesbool, optional

If True, when any of the MPC solver runs fails, or when an update fails, the corresponding error is raised; otherwise, only a warning is raised.

Raises:
MpcSolverError or MpcSolverWarning

Raises the error or the warning (depending on raises) if any of the MPC solvers fail.

UpdateError or UpdateWarning

Raises the error or the warning (depending on raises) if the update fails.

Return type:

None

property unwrapped: Agent#

Gets the underlying wrapped instance of an agent. In this case, since the agent is not wrapped at all, returns itself.

update()[source]#

Updates the learnable parameters (usually referred to as \(\theta\)) of the MPC according to the agent’s learning algorithm.

Returns:
errormsgstr or None

In case the update fails, an error message is returned to be raised as error or warning; otherwise, None is returned.

Return type:

Optional[str]

property update_strategy: UpdateStrategy#

Gets the update strategy of the agent.

property warmstart: WarmStartStrategy#

Gets the warmstart strategy used within this agent. This strategy is used to generate the initial guess for the solver to optimize the MPC’s NLP.

Examples using mpcrl.GlobOptLearningAgent#

Bayesian Optimization for MPC Data-driven Tuning

Bayesian Optimization for MPC Data-driven Tuning