Value functions
A policy is used to define the law of actions we need to follow when interacting with the environment, and a corresponding state-value function is defined as the expected reward that we can get in the future after we get to a state.
Similarly, the action-value function can be defined as
A recursive definition of the state-value function can be derived as following
This recursive definition is called the Bellman equation for
Optimal Policies and Optimal Value Functions
A policy
Optimal action-value function can be defined as
The relationship between
Dynamic Programming
Policy Evaluation
The policy evaluation is defined as the process to compute the corresponding state-value function
Policy Evaluation in Python:
# V here should be a class representing the value function to be optimized with
# __getitem__ and __setitem__ implemented. While V[terminal] should always return 0
# S should be the collection of all possible states that can appear in the
# environment
# PI should be a callable that returns the actions available for a state and the
# their respective possabilities of happening under the current policy
# T should be a callable representing the transition function of the environment,
# this callable should takes in a state and the action chosen, returns all the
# possible subsequent states, the possible rewards of the action taken and the
# possibility of the state-reward pair.
def policy_eval(V, S, PI, T, theta, gamma):
while True:
delta = 0
for s in S:
v = V[s]
temp = 0
for a, pa in PI(s):
for (s_p, r), psr in T(s, a):
temp += pa * psr * (r + gamma * V[s_p])
V[s] = temp
delta = max(delta, abs(v - temp))
if delta < theta:
break
Policy Improvement
Given the original policy
for an state
By induction, we can show that the policy entirely defined by the greedy formula shown above would be the optimal policy we have under the current value function.