mpcrl.optim.NewtonMethod#

class mpcrl.optim.NewtonMethod(learning_rate, weight_decay=0.0, cho_before_update=False, cho_maxiter=1000, cho_solve_kwargs=None, hook='on_update', max_percentage_update=inf, bound_consistency=False)[source]#

Bases: GradientBasedOptimizer[LrType]

Second-order gradient-based Newton’s method.

In constrast to the first-order methods, the Newton’s method uses also the Hessian of the loss function to compute the update. The unconstrained update is given by

\[\theta \gets \theta - \alpha H^{-1} g.\]

However, we do not directly use the provided Hessian, but rather its Cholesky decomposition after having ensured it is positive semi-definite via cholesky_added_multiple_identity. As usual, weight decay can be added, but for sake of simplicity it is not included in the formula above. In case there are constraints on the learnable parameters, the update is solved as a Quadratic Programming (QP) problem, which is slower than the unconstrained counterpart. This QP takes the form

\[\begin{split}\begin{aligned} \min_{\Delta\theta} & \quad \frac{1}{2} \Delta\theta^\top H \Delta\theta + \alpha g^\top \Delta\theta \\ \text{s.t.} & \quad \theta_{\text{lower}} \leq \theta + \Delta\theta \leq \theta_{\text{upper}} \end{aligned}\end{split}\]

if cho_before_update=False; otherwise, the objective is

\[\frac{1}{2} \lVert \Delta\theta \rVert_2^2 + \alpha (H^{-1} g)^\top \Delta\theta\]
Parameters:
learning_ratefloat or array or mpcrl.core.schedulers.Scheduler

The learning rate of the optimizer. It can be:

  • a float, in case the learning rate must stay constant and is the same for all learnable parameters

  • an array, in case the learning rate must stay constant but is different for each parameter (should have the same size as the number of learnable parameters)

  • a mpcrl.core.schedulers.Scheduler, in case the learning rate can vary during the learning process (usually, it is set to decay). See the hook argument for more details on when this scheduler is stepped.

weight_decayfloat, optional

A positive float that specifies the decay of the learnable parameters in the form of an L2 regularization term. By default, it is set to 0.0, so no decay/regularization takes place.

cho_before_updatebool, optional

Whether to perform a Cholesky’s factorization of the hessian in preparation of each update. If False, the quadratic form in the QP objective hosts the Hessian matrix; else if True, the linear system \(H^{-1} g\) is first solved via Cholesky’s factorization, and the QP update’s Hessian is downgraded to an identity matrix. Only relevant if the update is constrained. By default, False.

cho_maxiterint, optional

Maximum number of iterations in the Cholesky’s factorization with additive multiples of the identity to ensure positive definiteness of the hessian. By default, 1000.

cho_solve_kwargskwargs for scipy.linalg.cho_solve, optional

The optional kwargs to be passed to scipy.linalg.cho_solve to solve linear systems with the Hessian’s Cholesky decomposition. If None, it is set by default to cho_solve_kwargs = {"check_finite": False }. Only relevant if no weight decay is given.

hook{“on_update”, “on_episode_end”, “on_timestep_end”}, optional

Specifies when to step the optimizer’s learning rate’s scheduler to decay its value. This allows to vary the rate over the learning iterations. The options are:

  • "on_update" steps the learning rate after each agent’s update

  • "on_episode_end" steps the learning rate after each episode’s end

  • "on_timestep_end" steps the learning rate after each env’s timestep.

By default, "on_update" is selected.

max_percentage_updatefloat, optional

A positive float that specifies the maximum percentage change the learnable parameters can experience in each update. For example, max_percentage_update=0.5 means that the parameters can be updated by up to 50% of their current value. By default, it is set to +inf. If specified, the update becomes constrained and has to be solved as a QP, which is inevitably slower than its unconstrained counterpart (a linear system).

bound_consistencybool, optional

A boolean that, if True, forces the learnable parameters to lie in their bounds when updated. This is done via numpy.clip. Only beneficial if numerical issues arise during updates, e.g., due to the QP solver not being able to guarantee bounds.

Methods

set_learnable_parameters(pars)

Makes the optimization class aware of the dictionary of the learnable parameters whose values are to be updated.

step(*_, **__)

Steps/decays the learning rate according to its scheduler.

update(gradient[, hessian])

Computes the gradient-based update of the learnable parameters dictated by the current RL algorithm.

Attributes

hook

Gets the hook to which the scheduler is attached to, i.e., when to step the learning rate's scheduler to decay its value.

order

Gets the order of the optimizer: 1 for first-order, 2 for second-order.

property hook: str | None#

Gets the hook to which the scheduler is attached to, i.e., when to step the learning rate’s scheduler to decay its value.

Returns:
optional str

The hook to which the scheduler is attached to. Can be None in case no hook is needed (e.g., a scheduler was not passed as learning_rate).

property order: Literal[1, 2]#

Gets the order of the optimizer: 1 for first-order, 2 for second-order.

Returns:
1 or 2

The order of the optimizer.

set_learnable_parameters(pars)#

Makes the optimization class aware of the dictionary of the learnable parameters whose values are to be updated.

Parameters:
pars:class`mpcrl.LearnableParametersDict`

The dictionary of the learnable parameters.

Return type:

None

step(*_, **__)#

Steps/decays the learning rate according to its scheduler.

Return type:

None

update(gradient, hessian=None)#

Computes the gradient-based update of the learnable parameters dictated by the current RL algorithm.

Parameters:
gradient1D array

The gradient of the learnable parameters.

hessian2D array, optional

The hessian of the learnable parameters. When the optimizer is firt-order, it is expected to be None since it is unused. When the optimizer is second-order, it is expected to be a 2D array.

Returns:
statusstr, optional

An optional string containing the status of the update, e.g., the status of the QP solver, if used.

Return type:

Optional[str]

Examples using mpcrl.optim.NewtonMethod#

Off-policy Q-learning

Off-policy Q-learning

On-policy Q-learning

On-policy Q-learning