Weighted Sum of Errors
Usage Guide
To use the weighted sum of errors, you have to import the class, initialize an object and pass it to the environment.
import gym_electric_motor as gem
from gym_electric_motor.reward_functions import WeightedSumOfErrors
# initialize the reward function
wse = WeightedSumOfErrors(
reward_weights=dict(i_a=1, i_e=2) # Current control problem. Tracking of i_e is rewarded better.
reward_power=2 # Squared Error
# Alternative: reward_power=dict(i_a=1, i_e=0.5) Absolute error i_a, root error on i_e
bias='positive' # Shift the reward range from negative to positive
violation_reward=-250 # Self defined violation reward
gamma=0.9 # Ignored, if a violation_reward is defined.
normed_reward_weights=False # Otherwise weights will be normed automatically to sum up to 1.
)
# pass it to the environment
env = gem.make('my-env-id-v0', reward_function=wse)
API Documentation
- class gym_electric_motor.reward_functions.weighted_sum_of_errors.WeightedSumOfErrors(reward_weights=None, normed_reward_weights=False, violation_reward=None, gamma=0.9, reward_power=1, bias=0.0)[source]
A reward function that calculates the reward as the weighted sum of errors with a certain power.
\[r_{wse} = - \sum_i w_i ((|s_i-s^*_i|) / l_i) ^{n_i} + b\]- Notation:
\(r_\mathrm{wse}\): Weighted sum of error reward
\(w_{i}\): Reward weight of state \(i\)
\(s_{i}\): State value of state \(i\)
\(s^*_{i}\): Reference value of state \(i\)
\(l_{i}\): State length of state \(i\)
\(n_{i}\): Reward power of state \(i\)
\(b\): Bias
\(l_i = 1\) for states with positive values only.\(l_i = 2\) for states with positive and negative values.If environments constraints are violated to a certain degree, a special violation reward is returned as follows:
\[r_{total} = (1.0 - d_{violation}) r_{wse} + d_{violation} r_{violation}\]- Notation:
\(r_{total}\): Total reward
\(r_{wse}\): Weighted sum of error reward
\(r_{violation}\): Constraint violation reward
\(d_{violation}\): Limit violation degree \(d_{violation} \in [0,1]\)
The violation reward can be chosen freely by the user and shall punish the agents to comply with the constraints. Per default, the violation reward is selected so that it is always the worst expected reward the agent could get.
\[r_{violation} = r_{wse,min} / (1 - \gamma)\]\(r_{wse,min}\) is the minimal \(r_{wse}\) (=reward_range[0]) and \(\gamma\) the agents discount factor.
- Parameters:
reward_weights (dict/list/ndarray(float)) – Dict mapping state names to reward_weights, 0 otherwise. Or an array with the reward_weights on the position of the state_names.
normed_reward_weights (bool) – If True, the reward weights will be normalized to sum up to 1.
violation_reward (None/float) –
The punishment reward if constraints have been violated.
None(default): Per default, the violation reward is calculated as described above.
float: This value is taken as limit violation reward.
gamma (float in [0.0, 1.0]) – Discount factor for the reward punishment. Should equal agents’ discount factor gamma. Used only, if violation_reward=None.
reward_power (dict/list(float)/float) – Reward power for each of the systems states.
bias (float/'positive') –
Additional bias that is added to the reward.
float: The value that is added
’positive’: The bias is selected so that the minimal reward is zero and all further are positive.
- close()
Called, when the environment is closed to store logs, close files etc.
- reset(initial_state=None, initial_reference=None)
This function is called by the environment when reset.
Inner states of the reward function can be reset here, if necessary.
- Parameters:
initial_state (ndarray(float)) – Initial state array of the Environment
initial_reference (ndarray(float)) – Initial reference array of the environment.
- reward(state, reference, k=None, action=None, violation_degree=0.0)[source]
Reward calculation. If limits have been violated the reward is calculated with a separate function.
- Parameters:
state (ndarray(float)) – Environments state array.
reference (ndarray(float)) – Environments reference array.
k (int) – Systems momentary time-step
action (element of action space) – The previously taken action.
violation_degree (float in [0.0, 1.0]) – Degree of violation of the constraints. 0.0 indicates that all constraints are complied. 1.0 indicates that the constraints have been so much violated, that a reset is necessary.
- Returns:
Reward for this state, reference, action tuple.
- Return type:
float
- reward_range = (-inf, inf)
Lower and upper possible reward
- Type:
Tuple(int,int)
- set_modules(physical_system, reference_generator, constraint_monitor)[source]
Setting of the physical system, to set state arrays fitting to the environments states
- Parameters:
physical_system (PhysicalSystem) – The physical system of the environment
reference_generator (ReferenceGenerator) – The reference generator of the environment.
constraint_monitor (ConstraintMonitor) – The constraint monitor of the environment.