mpcrl.Agent#

class mpcrl.Agent(mpc, fixed_parameters=None, exploration=None, warmstart='last-successful', use_last_action_on_fail=False, remove_bounds_on_initial_action=False, name=None)[source]#

Bases: Named, AgentCallbackMixin, Generic[SymType]

Simple MPC-based agent with a fixed (i.e., non-learnable) MPC controller.

In this agent, the MPC controller parametrized in \(\theta\) is used as policy provider, as well as to provide the value function \(V_\theta(s)\) and quality function \(Q_\theta(s,a)\), where \(s\) and \(a\) are the state of the environment and a generic action, respectively. Since it only supports a fixed parametrization, this class does not use any RL or other learning method to improve its MPC policy.

Parameters:

mpccsnlp.wrappers.Mpc or tuple of csnlp.wrappers.Mpc

The MPC controller used as policy provider by this agent. If a tuple, the first entry is used to create the approximation of the state function \(V_\theta(s)\) and the second for that of \(Q_\theta(s,a)\). Otherwise, the instance is modified in place to create both approximations, so it is recommended not to modify it further after initialization of the agent. Moreover, some parameter and constraint names will need to be created, so an error is thrown if these names are already in use in the mpc.

fixed_parametersdict of (str, array_like) or collection of, optional

A dict (or collection of dict, in case of the mpc wrapping an underlying csnlp.multistart.MultistartNlp instance) whose keys are the names of the MPC parameters and the values are their corresponding values. Use this to specify fixed parameters, that is, non-learnable. If None, then no fixed parameter is assumed.

explorationcore.exploration.ExplorationStrategy, optional

Exploration strategy for inducing exploration in the online MPC policy. By default None, in which case core.exploration.NoExploration is used.

warmstart“last” or “last-successful” or WarmStartStrategy, optional

The warmstart strategy for the MPC’s NLP. If "last-successful", the last successful solution is used to warmstart the solver for the next iteration. If "last", the last solution is used, regardless of success or failure. Furthermore, an instance of core.warmstart.WarmStartStrategy can be passed to specify a strategy for generating multiple warmstart points for the MPC’s NLP instance. This is useful to generate multiple initial conditions for highly non-convex, nonlinear problems. This feature can only be used with an MPC that has an underlying multistart NLP problem (see csnlp.multistart).

use_last_action_on_failbool, optional

In case the MPC solver fails

if False, the action from the last solver’s iteration is returned anyway (though suboptimal)
if True, the action from the last successful call to the MPC is returned instead (if the MPC has been solved at least once successfully).

By default, False.

remove_bounds_on_initial_actionbool, optional

When True, the upper and lower bounds on the initial action are removed in the action-value function approximator \(Q_\theta(s,a)\) since the first action is constrained to be equal to the provided action \(a\). This is useful to avoid issues in the LICQ of the NLP. However, it can lead to numerical problems. By default, False.

namestr, optional

Name of the agent. If None, one is automatically created from a counter of the class’ instancies.

Raises:

ValueError

Raises if

the given mpc has no control action as optimization variable
the reserved parameter and constraint names are already in use (see cost_perturbation_parameter, init_action_parameter and init_action_constraint)
a multistart mpc is given, but the warmstart strategy warmstart asks for an incompatible number of starting points to be generated
a warmstart strategy warmstart or a collection of fixed parameters is given, but the mpc does not have an underlying multistart NLP problem, so it cannot handle multiple starting points or parameters (see csnlp.Nlp.is_multi and csnlp.multistart.MultistartNlp.is_multi).

Methods

`action_value`(state, action[, vals0])	Computes the MPC-based action value function approximation \(Q_\theta(s,a)\).
`evaluate`(env, episodes[, deterministic, ...])	Evaluates the agent in a given environment.
`is_wrapped`(_, *__)	Gets whether the agent instance is wrapped or not by the wrapper type.
`on_env_step`(env, episode, timestep)	Callback called after each call to `gymnasium.Env.step`.
`on_episode_end`(env, episode, rewards)	Callback called at the end of each episode in the training or evaluation process (see `mpcrl.Agent.evaluate`, `mpcrl.LearningAgent.train` and `mpcrl.LearningAgent.train_offpolicy`).
`on_episode_start`(env, episode, state)	Callback called at the beginning of each episode in the training or validation process (see `mpcrl.Agent.evaluate`, `mpcrl.LearningAgent.train` and `mpcrl.LearningAgent.train_offpolicy`).
`on_mpc_failure`(episode, timestep, status, raises)	Callback in case of failure of the MPC solver.
`on_timestep_end`(env, episode, timestep)	Callback called at the end of each time iteration.
`on_validation_end`(env, returns)	Callback called at the end of the validation process (see `mpcrl.Agent.evaluate`).
`on_validation_start`(env)	Callback called at the beginning of the validation process (see `mpcrl.Agent.evaluate`)
`reset`([seed])	Resets the agent.
`state_value`(state[, deterministic, vals0, ...])	Computes the MPC-based state value function approximation \(V_\theta(s)\).

Attributes

`Q`	Gets the MPC function approximation of the action value function \(Q_\theta(s,a)\).
`V`	Gets the MPC function approximation of the state value function \(V_\theta(s)\).
`cost_perturbation_method`	The name of the method from :class`numpy.random.Generator` to be used to generate perturbations of the cost function in the state value function \(V_\theta(s)\).
`cost_perturbation_parameter`	The name of the parameter to be added to the original `mpc` problem for perturbing the state value function \(V_\theta(s)\).
`exploration`	Gets the exploration strategy used within this agent to perturb the policy provided by the MPC controller via \(V_\theta(s)\).
`fixed_parameters`	Gets the fixed parameters of the MPC controller, i.e., the non-learnable ones.
`init_action_constraint`	Name of the equality constraint to be added to the original `mpc` problem for constraining the first action to be equal to \(a\) in the action value function \(Q_\theta(s,a)\).
`init_action_parameter`	Name of the parameter to be added to the original `mpc` problem for constraining the first action to be equal to \(a\) in the action value function \(Q_\theta(s,a)\).
`unwrapped`	Gets the underlying wrapped instance of an agent.
`warmstart`	Gets the warmstart strategy used within this agent.

property Q: Mpc[SymType]#: Gets the MPC function approximation of the action value function \(Q_\theta(s,a)\).

property V: Mpc[SymType]#: Gets the MPC function approximation of the state value function \(V_\theta(s)\).

action_value(state, action, vals0=None, **kwargs)[source]#

Computes the MPC-based action value function approximation \(Q_\theta(s,a)\).

Parameters:

statearray_like or dict of (str, array_like): The initial state at which to evaluate the action value function, i.e., \(s\) in \(Q_\theta(s,a)\). It can be either a 1D array representing the value of all initial states of the MPC, concatenated. Otherwise, a dict whose keys are the names of each state, and values are their numerical initial state values.
actionarray_like or dict of (str, array_like), optional: Same for state, but for the action, i.e., the initial action at which to evaluate the MPC action value function, i.e., \(a\) in \(Q_\theta(s,a)\).
vals0dict of (str, array_like) or iterable of, optional: A dict (or an iterable of dict, in case of csnlp.multistart.MultistartNlp is used), whose keys are the names of the MPC variables, and values are the numerical initial values of each variable. Use this argument to warmstart the MPC. If None, and a previous solution (possibly, successful) is available, the MPC solver is automatically warmstarted. If an iterable is passed instead, the warmstarting strategy is bypassed.

Returns:

Solution: The solution of the MPC approximation \(Q_\theta(s,a)\) at the given state and action pair.

Return type:

Solution[TypeVar(SymType, SX, MX)]

cost_perturbation_method = 'normal'#: The name of the method from :class`numpy.random.Generator` to be used to generate perturbations of the cost function in the state value function \(V_\theta(s)\).

cost_perturbation_parameter = 'cost_perturbation'#: The name of the parameter to be added to the original mpc problem for perturbing the state value function \(V_\theta(s)\).

evaluate(env, episodes, deterministic=True, seed=None, raises=True, env_reset_options=None)[source]#

Evaluates the agent in a given environment.

Parameters:

envEnv[ObsType, ActType]: The gym environment where to evaluate the agent in.
episodesint: Number of evaluation episodes.
deterministicbool, optional: Whether the agent should act deterministically, i.e., applying no exploration to the policy provided by the MPC. By default, True.
seedNone, int, array_like of ints, SeedSequence, BitGenerator, Generator: Seed for the agent’s and env’s random number generator. By default None.
raisesbool, optional: If True, when any of the MPC solver runs fails, or when an update fails, the corresponding error is raised; otherwise, only a warning is raised.
env_reset_optionsdict, optional: Additional information to specify how the environment is reset at each evalution episode (optional, depending on the specific environment).

Returns:

array of doubles: The cumulative returns (one return per evaluation episode).

Raises:

MpcSolverError or MpcSolverWarning: Raises if the MPC optimization solver fails and raises=True.

Return type:

ndarray[tuple[Any, ...], dtype[floating]]

Notes

After solving \(V_\theta(s)\) for the current env’s state s, the action is passed to the environment as the concatenation of the first optimal action variables of the MPC (see csnlp.Mpc.actions).

property exploration: ExplorationStrategy#: Gets the exploration strategy used within this agent to perturb the policy provided by the MPC controller via \(V_\theta(s)\).

Gets the fixed parameters of the MPC controller, i.e., the non-learnable ones.

Returns:

None or dict of (str, array_like), or collection of

The returned object can be either

None, if the MPC controller has no fixed parameters
a dict whose keys are the names of the MPC parameters and the values are their corresponding values, when the MPC controller wraps an instance of csnlp.Nlp, or it wraps an instance of csnlp.multistart.MultistartNlp but the same set of parameters is meant to be used for all scenarios
a collection of such dictionaries, when the MPC controller wraps an instance of csnlp.multistart.MultistartNlp and different parameters are meant to be used for each scenario.

init_action_constraint = 'a_init'#: Name of the equality constraint to be added to the original mpc problem for constraining the first action to be equal to \(a\) in the action value function \(Q_\theta(s,a)\).

init_action_parameter = 'a_init'#: Name of the parameter to be added to the original mpc problem for constraining the first action to be equal to \(a\) in the action value function \(Q_\theta(s,a)\).

is_wrapped(*_, **__)[source]#

Gets whether the agent instance is wrapped or not by the wrapper type.

Returns:

bool: A flag indicating whether the agent is wrapped or not.

Return type:

bool

on_env_step(env, episode, timestep)#

Callback called after each call to gymnasium.Env.step.

Parameters:

envgym env: A gym environment where the agent is being trained on.
episodeint: Number of the training episode.
timestepint: Time instant of the current training episode.

Return type:

None

on_episode_end(env, episode, rewards)#

Callback called at the end of each episode in the training or evaluation process (see mpcrl.Agent.evaluate, mpcrl.LearningAgent.train and mpcrl.LearningAgent.train_offpolicy).

Parameters:

envgym env: A gym environment where the agent is being trained on.
episodeint: Number of the training episode.
rewardsfloat: Cumulative rewards for this episode.

Return type:

None

on_episode_start(env, episode, state)#

Callback called at the beginning of each episode in the training or validation process (see mpcrl.Agent.evaluate, mpcrl.LearningAgent.train and mpcrl.LearningAgent.train_offpolicy).

Parameters:

envgym env: A gym environment where the agent is being trained on.
episodeint: Number of the training episode.
stateObsType: Starting state for this episode.

Return type:

None

on_mpc_failure(episode, timestep, status, raises)#

Callback in case of failure of the MPC solver.

Parameters:

episodeint: Number of the episode when the failure happened.
timestepint or None: Timestep of the current episode when the failure happened. Can be None, in case the error occurs inter-episodically or no notion of time step is available.
statusstr: Status of the solver that failed.
raisesbool: Whether the failure should be raised as exception (True) or as a warning (False).

Return type:

None

on_timestep_end(env, episode, timestep)#

Callback called at the end of each time iteration. It is called with the same frequency as on_env_step, but with different timing.

Parameters:

envgym env: A gym environment where the agent is being trained on.
episodeint: Number of the training episode.
timestepint: Time instant of the current training episode.

Return type:

None

on_validation_end(env, returns)#

Callback called at the end of the validation process (see mpcrl.Agent.evaluate).

Parameters:

envgym env: A gym environment where the agent has been validated on.
returnsarray of double: Each episode’s cumulative rewards.

Return type:

None

on_validation_start(env)#

Callback called at the beginning of the validation process (see mpcrl.Agent.evaluate)

Parameters:

envgym env: A gym environment where the agent is being validated on.

Return type:

None

reset(seed=None)[source]#

Resets the agent. This includes resetting the warmstart strategy, the exploration strategy, and the some internal variables of the agent.

Parameters:

seedRngType, optional: The seed to reset the numpy.random.Generator instances. By default, None.

Return type:

None

state_value(state, deterministic=False, vals0=None, action_space=None, **kwargs)[source]#

Computes the MPC-based state value function approximation \(V_\theta(s)\).

Parameters:

statearray_like or dict of (str, array_like): The initial state at which to evaluate the MPC approximation of the state value function, i.e., \(s\) in \(V_\theta(s)\). It can be either a 1D array representing the value of all initial states of the MPC, concatenated. Otherwise, a dict whose keys are the names of each state, and values are their numerical initial state values.
deterministicbool, optional: If False, the MPC controller is perturbed according to the exploration strategy to induce some exploratory behaviour. Otherwise, no perturbation is performed. By default, False.
vals0dict of (str, array_like) or iterable of, optional: A dict (or an iterable of dict, in case of csnlp.multistart.MultistartNlp is used), whose keys are the names of the MPC variables, and values are the numerical initial values of each variable. Use this argument to warmstart the MPC. If None, and a previous solution (possibly, successful) is available, the MPC solver is automatically warmstarted. If an iterable is passed instead, the warmstarting strategy is bypassed.
action_spacegymnasium.spaces.Box, optional: The action space of the environment the agent is being evaluated/trained on. If not None, it is used in case an additive exploration perturbation is summed to the action in order to clip it back into the action space.

Returns:

casadi.DM: The first optimal action according to the solution of the state value function, possibly perturbed by exploration noise, i.e.,

\[u_0^\star = \arg\min_{u} V_\theta(s)\]
Solution: The solution of the MPC approximation \(V_\theta(s)\) at the given state.

Return type:

tuple[DM, Solution[TypeVar(SymType, SX, MX)]]

property unwrapped: Agent#: Gets the underlying wrapped instance of an agent. In this case, since the agent is not wrapped at all, returns itself.

property warmstart: WarmStartStrategy#: Gets the warmstart strategy used within this agent. This strategy is used to generate the initial guess for the solver to optimize the MPC’s NLP.

Examples using `mpcrl.Agent`#

Off-policy Q-learning

On-policy Deterministic Policy Gradient

On-policy Q-learning

mpcrl.Agent#

Examples using mpcrl.Agent#

This Page

Examples using `mpcrl.Agent`#