Reinforcement Learning I

Yanksi Li

EnglishStudy NotesMachine LearningReinforcement Learning

Published: Jun 28, 2022

Value functions

A policy is used to define the law of actions we need to follow when interacting with the environment, and a corresponding state-value function is defined as the expected reward that we can get in the future after we get to a state.

$v_\pi(s)\dot{=} \mathbb{E}_\pi[G_t|S_t=s] = \mathbb{E}_\pi \Big[\sum_{k=0}^\infty \gamma^k R_{t+k+1}\Big|S_t=s\Big],\forall s\in\mathcal{S}$

Similarly, the action-value function can be defined as

$q_\pi(s,a)\dot{=} \mathbb{E}_\pi[G_t|S_t=s,A_t=a] = \mathbb{E}_\pi \Big[\sum_{k=0}^\infty \gamma^k R_{t+k+1}\Big|S_t=s,A_t=a\Big],\forall s\in\mathcal{S} \land \forall a\in\mathcal A(s)$

A recursive definition of the state-value function can be derived as following

$\begin{align*} v_\pi(s) {\dot{=}} &\mathbb{E}_\pi[G_t|S_t=s]\\ = &\mathbb{E}_pi[R_{t+1}+\gamma G_{t+1}|S_t=s]\\ = &\sum_a\pi(a|s)\sum_{s'}\sum_r p(s',r|s,a)\big[r+\gamma\mathbb{E}_\pi[G_{t+1}|S_{t+1}=s']\big]\\ = &\sum_a\pi(a|s)\sum_{s',r} p(s',r|s,a)\big[r+\gamma v_\pi(s')\big], \forall s\in\mathcal{S}\\ \end{align*}$

This recursive definition is called the Bellman equation for .

Optimal Policies and Optimal Value Functions

A policy is defined to be better than or euqal to a policy if its expected return is greater than or equal to that of for all states. In other words, iff for all . An optimal policy can be defined as the policy that is not worse than any other policies. The state-value function corresponding to the optimal policy can be defined as

$v_*(s)\dot{=}\max_{\pi}v_{\pi}(s), \forall s\in\mathcal S$

Optimal action-value function can be defined as

$q_*(s,a)\dot{=}\max_{\pi}q_{\pi}(s,a), \forall s\in\mathcal S \land \forall a\in\mathcal A(s)$

The relationship between and can be

$q_*(s,a)=\mathbb{E}[R_{t+1}+\gamma v_*(S_{t+1})|S_t=s,A_t=a]$

Dynamic Programming

Policy Evaluation

The policy evaluation is defined as the process to compute the corresponding state-value function for an arbitrary policy . would satisfy the Bellman equation for policy . If the dynamics of the environment is completely known, this can be easily computed by solving a large linear system according to the definition formula for . However, using a iterative way to solve for such system would be preferred as the solution itself would be a fixed point in such linear system. The initial approximation can be chosen arbitrarily (except for those termianl states, if any, must be 0).

Policy Evaluation in Python:

# V here should be a class representing the value function to be optimized with
# __getitem__ and __setitem__ implemented. While V[terminal] should always return 0

# S should be the collection of all possible states that can appear in the
# environment

# PI should be a callable that returns the actions available for a state and the
# their respective possabilities of happening under the current policy

# T should be a callable representing the transition function of the environment,
# this callable should takes in a state and the action chosen, returns all the
# possible subsequent states, the possible rewards of the action taken and the
# possibility of the state-reward pair.
def policy_eval(V, S, PI, T, theta, gamma):
    while True:
        delta = 0
        for s in S:
            v = V[s]
            temp = 0
            for a, pa in PI(s):
                for (s_p, r), psr in T(s, a):
                    temp += pa * psr * (r + gamma * V[s_p])
            V[s] = temp
            delta = max(delta, abs(v - temp))
        if delta < theta:
            break

Policy Improvement

Given the original policy and the evaluated value function of it, it is easy to prove that we can get a better or equally good policy if we have:

$\begin{aligned} \pi'(s) \dot{=} &\underset{a}{\text{argmax }}q_\pi(s,a)\\ = &\underset{a}{\text{argmax }}\mathbb{E}[R_{t+1}+\gamma v_\pi(S_{t+1})|S_t=s,A_t=a]\\ = &\underset{a}{\text{argmax }} \sum_{s',r}p(s',r|s,a)\Big[r+\gamma v_\pi(s')\Big] \end{aligned}$

for an state , and for all .

By induction, we can show that the policy entirely defined by the greedy formula shown above would be the optimal policy we have under the current value function.

Modified: Jul 5, 2022

Reinforcement Learning

Study Notes For Fractional Calculus II

Differintegrals for FunctionsProof of Differintegral Power RuleLet .Then by standard calculus, we...