抱歉,您的浏览器无法访问本站

本页面需要浏览器支持(启用)JavaScript


了解详情 >

Value functions

A policy is used to define the law of actions we need to follow when interacting with the environment, and a corresponding state-value function is defined as the expected reward that we can get in the future after we get to a state.

Similarly, the action-value function can be defined as

A recursive definition of the state-value function can be derived as following

This recursive definition is called the Bellman equation for .

Optimal Policies and Optimal Value Functions

A policy is defined to be better than or euqal to a policy if its expected return is greater than or equal to that of for all states. In other words, iff for all . An optimal policy can be defined as the policy that is not worse than any other policies. The state-value function corresponding to the optimal policy can be defined as

Optimal action-value function can be defined as

The relationship between and can be

Dynamic Programming

Policy Evaluation

The policy evaluation is defined as the process to compute the corresponding state-value function for an arbitrary policy . would satisfy the Bellman equation for policy . If the dynamics of the environment is completely known, this can be easily computed by solving a large linear system according to the definition formula for . However, using a iterative way to solve for such system would be preferred as the solution itself would be a fixed point in such linear system. The initial approximation can be chosen arbitrarily (except for those termianl states, if any, must be 0).

Policy Evaluation in Python:

# V here should be a class representing the value function to be optimized with
# __getitem__ and __setitem__ implemented. While V[terminal] should always return 0

# S should be the collection of all possible states that can appear in the
# environment

# PI should be a callable that returns the actions available for a state and the
# their respective possabilities of happening under the current policy

# T should be a callable representing the transition function of the environment,
# this callable should takes in a state and the action chosen, returns all the
# possible subsequent states, the possible rewards of the action taken and the
# possibility of the state-reward pair.
def policy_eval(V, S, PI, T, theta, gamma):
    while True:
        delta = 0
        for s in S:
            v = V[s]
            temp = 0
            for a, pa in PI(s):
                for (s_p, r), psr in T(s, a):
                    temp += pa * psr * (r + gamma * V[s_p])
            V[s] = temp
            delta = max(delta, abs(v - temp))
        if delta < theta:
            break

Policy Improvement

Given the original policy and the evaluated value function of it, it is easy to prove that we can get a better or equally good policy if we have:

for an state , and for all .

By induction, we can show that the policy entirely defined by the greedy formula shown above would be the optimal policy we have under the current value function.

评论