mpcrl.LstdDpgAgent#
- class mpcrl.LstdDpgAgent(mpc, update_strategy, discount_factor, optimizer, learnable_parameters, exploration, fixed_parameters=None, experience=None, warmstart='last-successful', rollout_length=-1, record_policy_performance=False, record_policy_gradient=False, state_features=None, linsolver='csparse', ridge_regression_regularization=1e-06, use_last_action_on_fail=False, name=None)[source]#
Bases:
RlLearningAgent[SymType,tuple[ndarray[tuple[Any, …],dtype[floating]],ndarray[tuple[Any, …],dtype[floating]],ndarray[tuple[Any, …],dtype[floating]],ndarray[tuple[Any, …],dtype[floating]],ndarray[tuple[Any, …],dtype[floating]]],LrType],Generic[SymType,LrType]Least-Squares Temporal Difference (LSTD) Deterministic Policy Gradient (DPG) agent, as introduced in [6] as its stochastic counterpart, and refined in [4]. An application can be found in [2].
The DPG agent uses an MPC controller as policy provider and function approximation, and adjusts its parametrization according to the temporal-difference error, with the goal of improving the policy, in a direct fashion by estimating the gradient of the policy and descending in its direction.
- Parameters:
- mpc
csnlp.wrappers.Mpcor tuple ofcsnlp.wrappers.Mpc The MPC controller used as policy provider by this agent. If a tuple, the first entry is used to create the approximation of the state function \(V_\theta(s)\) and the second for that of \(Q_\theta(s,a)\). Otherwise, the instance is modified in place to create both approximations, so it is recommended not to modify it further after initialization of the agent. Moreover, some parameter and constraint names will need to be created, so an error is thrown if these names are already in use in the mpc.
- update_strategyUpdateStrategy or int
The strategy used to decide which frequency to update the mpc parameters with. If an
intis passed, then the default strategy that updates everynepisodes is used (wherenis the argument passed); otherwise, an instance ofcore.update.UpdateStrategycan be passed to specify the desired strategy in more details.- discount_factorfloat
In RL, the factor that discounts future rewards in favor of immediate rewards. Usually denoted as \(\gamma\). It should satisfy \(\gamma \in (0, 1]\).
- optimizerGradientBasedOptimizer
A gradient-based optimizer (e.g.,
optim.GradientDescent) to compute the updates of the learnable parameters, based on the current gradient-based RL algorithm.- learnable_parameters
core.parameters.LearnableParametersDict A special dict containing the learnable parameters of the MPC (usually referred to as \(\theta\)), together with their bounds and values. This dict is complementary to
fixed_parameters, which contains the MPC parameters that are not learnt by the agent.- exploration
core.exploration.ExplorationStrategy Exploration strategy for inducing exploration in the online MPC policy. It is mandatory for DPG agents to have exploration.
- fixed_parametersdict of (str, array_like) or collection of, optional
A dict (or collection of dict, in case of the
mpcwrapping an underlyingcsnlp.multistart.MultistartNlpinstance) whose keys are the names of the MPC parameters and the values are their corresponding values. Use this to specify fixed parameters, that is, non-learnable. IfNone, then no fixed parameter is assumed.- experienceint or ExperienceReplay, optional
The container for experience replay memory. If
Noneis passed, then a memory with unitary length is created, i.e., it keeps only the latest memory transition. If an integernis passed, then a memory with the lengthnis created and with sample sizen. Otherwise, pass an instance ofcore.experience.ExperienceReplayto specify the requirements in more details.- warmstart“last” or “last-successful” or WarmStartStrategy, optional
The warmstart strategy for the MPC’s NLP. If
"last-successful", the last successful solution is used to warmstart the solver for the next iteration. If"last", the last solution is used, regardless of success or failure. Furthermore, an instance ofcore.warmstart.WarmStartStrategycan be passed to specify a strategy for generating multiple warmstart points for the MPC’s NLP instance. This is useful to generate multiple initial conditions for highly non-convex, nonlinear problems. This feature can only be used with an MPC that has an underlying multistart NLP problem (seecsnlp.multistart).- rollout_lengthint, optional
Number of steps of each closed-loop simulation, which defines a complete trajectory of the states (i.e., a rollout), and is saved in the experience as a single item (since LSTD DPG needs to draw samples of trajectories). In case the env is episodic, it can be
-1, in which case the rollout length coincides with the episode’s length. In case the env is not episodic, i.e., it never terminates, a length>0must be given in order to know when to save the current trajectory as an atomic item in memory.- record_policy_performance: bool, optional
If
True, the performance of each rollout is stored in the fieldpolicy_performances, which otherwise isNone. By default, does not record them.- record_policy_gradient: bool, optional
If
True, the (estimated) policy gradient of each update is stored in the fieldpolicy_gradients, which otherwise is None. By default, does not record them.- state_featurescasadi.Function, optional
The state feature vector to be used in the linear approximation of the value function, which takes the form of
\[V_v(s) = \Phi(s)^\top v,\]where \(s\) is the state, \(v\) are the weights, and \(\Phi(s)\) is the state feature vector. This function is assumed to have one input and one output. By default, if not provided, it is designed as all monomials of the state with degrees
<= 2(seeutil.math.monomials_basis_function).- linsolver“csparse” or “mldivide”, optional
The type of linear solver to be used for solving the linear system derived from the KKT conditions and used to estimate the gradient of the policy. By default,
"csparse"is chosen as the KKT matrix is most often sparse.- ridge_regression_regularizationfloat, optional
Ridge regression regularization used during the computations of the LSTD weights via least-squares. By default,
1e-6.- use_last_action_on_failbool, optional
- In case the MPC solver fails
if
False, the action from the last solver’s iteration is returned anyway (though suboptimal)if
True, the action from the last successful call to the MPC is returned instead (if the MPC has been solved at least once successfully).
By default,
False.- namestr, optional
Name of the agent. If
None, one is automatically created from a counter of the class’ instancies.
- mpc
- Raises:
- ValueError
If the exploration strategy is
Noneor an instance ofNoExploration, as DPG requires exploration.
Notes
If a second-order gradient-based
optimizeris provided, then the Fisher information matrix is used to perform a second-order natural policy gradient update. Otherwise, a first-order update is performed.Methods
action_value(state, action[, vals0])Computes the MPC-based action value function approximation \(Q_\theta(s,a)\).
evaluate(*args, **kwargs)Evaluates the agent in a given environment.
is_wrapped(*_, **__)Gets whether the agent instance is wrapped or not by the wrapper type.
on_env_step(env, episode, timestep)Callback called after each call to
gymnasium.Env.step.on_episode_end(env, episode, rewards)Callback called at the end of each episode in the training or evaluation process (see
mpcrl.Agent.evaluate,mpcrl.LearningAgent.trainandmpcrl.LearningAgent.train_offpolicy).on_episode_start(env, episode, state)Callback called at the beginning of each episode in the training or validation process (see
mpcrl.Agent.evaluate,mpcrl.LearningAgent.trainandmpcrl.LearningAgent.train_offpolicy).on_mpc_failure(episode, timestep, status, raises)Callback in case of failure of the MPC solver.
on_timestep_end(env, episode, timestep)Callback called at the end of each time iteration.
on_training_end(env, returns)Callback called at the end of the training process.
on_training_start(env)Callback called at the beginning of the training process.
Callback called after each
mpcrl.LearningAgent.update.on_update_failure(episode, timestep, ...)Callback in case of update failure.
on_validation_end(env, returns)Callback called at the end of the validation process (see
mpcrl.Agent.evaluate).on_validation_start(env)Callback called at the beginning of the validation process (see
mpcrl.Agent.evaluate)reset([seed])Resets the agent.
state_value(state[, deterministic, vals0, ...])Computes the MPC-based state value function approximation \(V_\theta(s)\).
store_experience(item)Stores the given item in the agent's
experiencefor later usage in updating the parametrization.train(env, episodes[, seed, raises, ...])On-policy training of the agent on an environment.
train_offpolicy(episode_rollouts[, seed, raises])Off-policy training of the agent on an environment.
train_one_episode(env, episode, init_state)On-policy training of the agent on an environment for one single episode.
train_one_rollout(rollout, episode[, raises])Train the agent in an off-policy manner on the given rollout.
update()Updates the learnable parameters (usually referred to as \(\theta\)) of the MPC according to the agent's learning algorithm.
Attributes
Gets the MPC function approximation of the action value function \(Q_\theta(s,a)\).
Gets the MPC function approximation of the state value function \(V_\theta(s)\).
The name of the method from :class`numpy.random.Generator` to be used to generate perturbations of the cost function in the state value function \(V_\theta(s)\).
The name of the parameter to be added to the original
mpcproblem for perturbing the state value function \(V_\theta(s)\).Gets the experience replay memory of the agent.
Gets the exploration strategy used within this agent to perturb the policy provided by the MPC controller via \(V_\theta(s)\).
Gets the fixed parameters of the MPC controller, i.e., the non-learnable ones.
Name of the equality constraint to be added to the original
mpcproblem for constraining the first action to be equal to \(a\) in the action value function \(Q_\theta(s,a)\).Name of the parameter to be added to the original
mpcproblem for constraining the first action to be equal to \(a\) in the action value function \(Q_\theta(s,a)\).Gets the parameters of the MPC that can be learnt by the agent.
Gets the underlying wrapped instance of an agent.
Gets the update strategy of the agent.
Gets the warmstart strategy used within this agent.
- property Q: Mpc[SymType]#
Gets the MPC function approximation of the action value function \(Q_\theta(s,a)\).
- property V: Mpc[SymType]#
Gets the MPC function approximation of the state value function \(V_\theta(s)\).
- action_value(state, action, vals0=None, **kwargs)#
Computes the MPC-based action value function approximation \(Q_\theta(s,a)\).
- Parameters:
- statearray_like or dict of (str, array_like)
The initial state at which to evaluate the action value function, i.e., \(s\) in \(Q_\theta(s,a)\). It can be either a 1D array representing the value of all initial states of the MPC, concatenated. Otherwise, a dict whose keys are the names of each state, and values are their numerical initial state values.
- actionarray_like or dict of (str, array_like), optional
Same for
state, but for the action, i.e., the initial action at which to evaluate the MPC action value function, i.e., \(a\) in \(Q_\theta(s,a)\).- vals0dict of (str, array_like) or iterable of, optional
A dict (or an iterable of dict, in case of
csnlp.multistart.MultistartNlpis used), whose keys are the names of the MPC variables, and values are the numerical initial values of each variable. Use this argument to warmstart the MPC. IfNone, and a previous solution (possibly, successful) is available, the MPC solver is automatically warmstarted. If an iterable is passed instead, the warmstarting strategy is bypassed.
- Returns:
- Solution
The solution of the MPC approximation \(Q_\theta(s,a)\) at the given state and action pair.
- Return type:
- cost_perturbation_method = 'normal'#
The name of the method from :class`numpy.random.Generator` to be used to generate perturbations of the cost function in the state value function \(V_\theta(s)\).
- cost_perturbation_parameter = 'cost_perturbation'#
The name of the parameter to be added to the original
mpcproblem for perturbing the state value function \(V_\theta(s)\).
- evaluate(*args, **kwargs)#
Evaluates the agent in a given environment.
- Parameters:
- envEnv[ObsType, ActType]
The gym environment where to evaluate the agent in.
- episodesint
Number of evaluation episodes.
- deterministicbool, optional
Whether the agent should act deterministically, i.e., applying no exploration to the policy provided by the MPC. By default,
True.- seedNone, int, array_like of ints, SeedSequence, BitGenerator, Generator
Seed for the agent’s and env’s random number generator. By default
None.- raisesbool, optional
If
True, when any of the MPC solver runs fails, or when an update fails, the corresponding error is raised; otherwise, only a warning is raised.- env_reset_optionsdict, optional
Additional information to specify how the environment is reset at each evalution episode (optional, depending on the specific environment).
- Returns:
- array of doubles
The cumulative returns (one return per evaluation episode).
- Raises:
- MpcSolverError or MpcSolverWarning
Raises if the MPC optimization solver fails and
raises=True.
- Return type:
Notes
After solving \(V_\theta(s)\) for the current env’s state s, the action is passed to the environment as the concatenation of the first optimal action variables of the MPC (see csnlp.Mpc.actions).
- property experience: ExperienceReplay[ExpType]#
Gets the experience replay memory of the agent.
- property exploration: ExplorationStrategy#
Gets the exploration strategy used within this agent to perturb the policy provided by the MPC controller via \(V_\theta(s)\).
- property fixed_parameters: None | dict[str, Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]] | Collection[dict[str, Buffer | _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | complex | bytes | str | _NestedSequence[complex | bytes | str]]]#
Gets the fixed parameters of the MPC controller, i.e., the non-learnable ones.
- Returns:
Noneor dict of (str, array_like), or collection of- The returned object can be either
None, if the MPC controller has no fixed parametersa dict whose keys are the names of the MPC parameters and the values are their corresponding values, when the MPC controller wraps an instance of
csnlp.Nlp, or it wraps an instance ofcsnlp.multistart.MultistartNlpbut the same set of parameters is meant to be used for all scenariosa collection of such dictionaries, when the MPC controller wraps an instance of
csnlp.multistart.MultistartNlpand different parameters are meant to be used for each scenario.
- init_action_constraint = 'a_init'#
Name of the equality constraint to be added to the original
mpcproblem for constraining the first action to be equal to \(a\) in the action value function \(Q_\theta(s,a)\).
- init_action_parameter = 'a_init'#
Name of the parameter to be added to the original
mpcproblem for constraining the first action to be equal to \(a\) in the action value function \(Q_\theta(s,a)\).
- is_wrapped(*_, **__)#
Gets whether the agent instance is wrapped or not by the wrapper type.
- Returns:
- bool
A flag indicating whether the agent is wrapped or not.
- Return type:
- property learnable_parameters: LearnableParametersDict#
Gets the parameters of the MPC that can be learnt by the agent.
- on_env_step(env, episode, timestep)#
Callback called after each call to
gymnasium.Env.step.- Parameters:
- envgym env
A gym environment where the agent is being trained on.
- episodeint
Number of the training episode.
- timestepint
Time instant of the current training episode.
- Return type:
- on_episode_end(env, episode, rewards)#
Callback called at the end of each episode in the training or evaluation process (see
mpcrl.Agent.evaluate,mpcrl.LearningAgent.trainandmpcrl.LearningAgent.train_offpolicy).- Parameters:
- envgym env
A gym environment where the agent is being trained on.
- episodeint
Number of the training episode.
- rewardsfloat
Cumulative rewards for this episode.
- Return type:
- on_episode_start(env, episode, state)#
Callback called at the beginning of each episode in the training or validation process (see
mpcrl.Agent.evaluate,mpcrl.LearningAgent.trainandmpcrl.LearningAgent.train_offpolicy).- Parameters:
- envgym env
A gym environment where the agent is being trained on.
- episodeint
Number of the training episode.
- stateObsType
Starting state for this episode.
- Return type:
- on_mpc_failure(episode, timestep, status, raises)#
Callback in case of failure of the MPC solver.
- Parameters:
- episodeint
Number of the episode when the failure happened.
- timestepint or None
Timestep of the current episode when the failure happened. Can be
None, in case the error occurs inter-episodically or no notion of time step is available.- statusstr
Status of the solver that failed.
- raisesbool
Whether the failure should be raised as exception (
True) or as a warning (False).
- Return type:
- on_timestep_end(env, episode, timestep)#
Callback called at the end of each time iteration. It is called with the same frequency as
on_env_step, but with different timing.- Parameters:
- envgym env
A gym environment where the agent is being trained on.
- episodeint
Number of the training episode.
- timestepint
Time instant of the current training episode.
- Return type:
- on_training_end(env, returns)#
Callback called at the end of the training process.
- Parameters:
- envgym env
A gym environment where the agent has been trained on.
- returnsarray of double
Each episode’s cumulative rewards.
- Return type:
- on_training_start(env)#
Callback called at the beginning of the training process.
- Parameters:
- envgym env
A gym environment where the agent is being trained on.
- Return type:
- on_update()#
Callback called after each
mpcrl.LearningAgent.update.This callback is especially useful for, e.g., decaying exploration probabilities or learning rates.
- Return type:
- on_update_failure(episode, timestep, errormsg, raises)#
Callback in case of update failure.
- Parameters:
- episodeint
Number of the episode when the failure happened.
- timestepint or None
Timestep of the current episode when the failure happened. Can be
Nonein case the update occurs inter-episodically or no notion of time step is available.- errormsgstr
Error message of the update failure.
- raisesbool
Whether the failure should be raised as exception (
True) or as a warning (False).
- Return type:
- on_validation_end(env, returns)#
Callback called at the end of the validation process (see
mpcrl.Agent.evaluate).- Parameters:
- envgym env
A gym environment where the agent has been validated on.
- returnsarray of double
Each episode’s cumulative rewards.
- Return type:
- on_validation_start(env)#
Callback called at the beginning of the validation process (see
mpcrl.Agent.evaluate)- Parameters:
- envgym env
A gym environment where the agent is being validated on.
- Return type:
- reset(seed=None)#
Resets the agent. This includes resetting the warmstart strategy, the exploration strategy, and the some internal variables of the agent.
- Parameters:
- seedRngType, optional
The seed to reset the
numpy.random.Generatorinstances. By default,None.
- Return type:
- state_value(state, deterministic=False, vals0=None, action_space=None, **kwargs)#
Computes the MPC-based state value function approximation \(V_\theta(s)\).
- Parameters:
- statearray_like or dict of (str, array_like)
The initial state at which to evaluate the MPC approximation of the state value function, i.e., \(s\) in \(V_\theta(s)\). It can be either a 1D array representing the value of all initial states of the MPC, concatenated. Otherwise, a dict whose keys are the names of each state, and values are their numerical initial state values.
- deterministicbool, optional
If
False, the MPC controller is perturbed according to theexplorationstrategy to induce some exploratory behaviour. Otherwise, no perturbation is performed. By default,False.- vals0dict of (str, array_like) or iterable of, optional
A dict (or an iterable of dict, in case of
csnlp.multistart.MultistartNlpis used), whose keys are the names of the MPC variables, and values are the numerical initial values of each variable. Use this argument to warmstart the MPC. IfNone, and a previous solution (possibly, successful) is available, the MPC solver is automatically warmstarted. If an iterable is passed instead, the warmstarting strategy is bypassed.- action_spacegymnasium.spaces.Box, optional
The action space of the environment the agent is being evaluated/trained on. If not
None, it is used in case an additive exploration perturbation is summed to the action in order to clip it back into the action space.
- Returns:
- casadi.DM
The first optimal action according to the solution of the state value function, possibly perturbed by exploration noise, i.e.,
\[u_0^\star = \arg\min_{u} V_\theta(s)\]- Solution
The solution of the MPC approximation \(V_\theta(s)\) at the given state.
- Return type:
- store_experience(item)#
Stores the given item in the agent’s
experiencefor later usage in updating the parametrization.- Parameters:
- itemExpType
Item to be stored in memory.
- Return type:
- train(env, episodes, seed=None, raises=True, env_reset_options=None)#
On-policy training of the agent on an environment.
- Parameters:
- envEnv[ObsType, ActType]
The gym environment where to train the agent on.
- episodesint
Number of training episodes.
- seedNone, int, array_like of ints, SeedSequence, BitGenerator, Generator
Seed for the agent’s and env’s random number generator. By default
None.- raisesbool, optional
If
True, when any of the MPC solver runs fails, or when an update fails, the corresponding error is raised; otherwise, only a warning is raised.- env_reset_optionsdict, optional
Additional information to specify how the environment is reset at each evalution episode (optional, depending on the specific environment).
- Returns:
- array of doubles
The cumulative returns for each training episode.
- Raises:
- MpcSolverError or MpcSolverWarning
Raises the error or the warning (depending on
raises) if any of the MPC solvers fail.- UpdateError or UpdateWarning
Raises the error or the warning (depending on
raises) if the update fails.
- Return type:
- train_offpolicy(episode_rollouts, seed=None, raises=True)#
Off-policy training of the agent on an environment.
- Parameters:
- episode_rolloutsiterable of iterables of any
An iterable of episodical rollouts generated in an off-policy fashion. Each rollout is itself a sequence of transitions, e.g., SARSA tuples. In other words, episode_rollouts is a sequence of sequences of tuples. However, in general, its nature and the tuples’ can widely differ from learning algorithm to learning algorithm.
- seedNone, int, array_like of ints, SeedSequence, BitGenerator, Generator
Seed for the agent’s random number generator. By default
None.- raisesbool, optional
If
True, when any of the MPC solver runs fails, or when an update fails, the corresponding error is raised; otherwise, only a warning is raised.
- Raises:
- MpcSolverError or MpcSolverWarning
Raises the error or the warning (depending on
raises) if any of the MPC solvers fail.- UpdateError or UpdateWarning
Raises the error or the warning (depending on
raises) if the update fails.
- Return type:
- train_one_episode(env, episode, init_state, raises=True)[source]#
On-policy training of the agent on an environment for one single episode.
- Parameters:
- envEnv[ObsType, ActType]
The gym environment where to train the agent on.
- episodeint
Number of the current training episode.
- init_stateobservation type
Initial state/observation of the environment.
- raisesbool, optional
If
True, when any of the MPC solver runs fails, or when an update fails, the corresponding error is raised; otherwise, only a warning is raised.
- Returns:
- float
The cumulative rewards for this training episode.
- Raises:
- MpcSolverError or MpcSolverWarning
Raises the error or the warning (depending on
raises) if any of the MPC solvers fail.- UpdateError or UpdateWarning
Raises the error or the warning (depending on
raises) if the update fails.
- Return type:
- train_one_rollout(rollout, episode, raises=True)#
Train the agent in an off-policy manner on the given rollout.
- Parameters:
- rolloutiterable of any
Rollout, i.e., a sequence of transitions generated off-policy, e.g., SARSA tuples. However, in general, these tuples can be of different nature, depending on the specific learning algorithm.
- raisesbool, optional
If
True, when any of the MPC solver runs fails, or when an update fails, the corresponding error is raised; otherwise, only a warning is raised.
- Raises:
- MpcSolverError or MpcSolverWarning
Raises the error or the warning (depending on
raises) if any of the MPC solvers fail.- UpdateError or UpdateWarning
Raises the error or the warning (depending on
raises) if the update fails.
- Return type:
- property unwrapped: Agent#
Gets the underlying wrapped instance of an agent. In this case, since the agent is not wrapped at all, returns itself.
- update()[source]#
Updates the learnable parameters (usually referred to as \(\theta\)) of the MPC according to the agent’s learning algorithm.
- property update_strategy: UpdateStrategy#
Gets the update strategy of the agent.
- property warmstart: WarmStartStrategy#
Gets the warmstart strategy used within this agent. This strategy is used to generate the initial guess for the solver to optimize the MPC’s NLP.