Example: Grid World

Example environment to be used frequently; 80% of time, agent takes desired action, 10% of the time it goes left of desired action, 10% of time it goes right. Receives small “living” rewards for each step, and big rewards (positive or negative) at the end of the game.

Markov Decision Processes

A Markov decision process (MDP) is a 4-tuple $(\mathcal{S}, \mathcal{A}, P_a, R_a)$ where

$\mathcal{S}$ : finite set of possible states, known as the state space
$\mathcal{A}$ : finite set of possible actions, known as the action space; if actions are conditional on state, the notation $\mathcal{A}_s$ indicates the set of actions available from state $s$
$P_a$ : probabilistic state transition function $P_a: \mathcal{S} \times \mathcal{S} \mapsto [0,1]$ , giving the probability that taking action $a$ in state $s$ at time $t$ will result in state $s^{\prime}$ at time $t+1$ :

$P_a(s,s^{\prime}) = P(s_{t+1}=s^{\prime} | s_t=s,a_t=a)$

Note that $P_a$ is a family of functions for the various actions $a \in \mathcal{A}$ that one might want to reason about (i.e. $P_a$ in its general form is defined without specific $a$ , but it is not a function in this state; a concrete value for $a$ must first be determined. Additionally, plotting $P_a$ does not give a meaningful plot as has been discussed extensively when interpreting probability densities; you would need to plot it in conjunction with a prior over actions in order to recover the joint density.) Another common formulation of the transition function is $T(s,a,s^\prime) = P(s^\prime|s,a)$ which encodes the action $a$ explicitly in the function arguments.
$R_a$ : immediate reward received for transitioning from state $s$ to $s^{\prime}$ by action $a$ ; commonly defined as a function $R_a: \mathcal{S} \times \mathcal{S} \mapsto \mathbb{R}$
A start state $s_0 \in \mathcal{S}$ and possibly a terminal state

Markov Property

MDPs have the Markov property which says that the current state tells you everything you can know about the future from the past. That is, given the current state the future and past are independent. We can typically make this happen by pushing enough information into the state so it encodes all relevant information from the past.

Policies

An agent implements a decision making process according to a policy function $\pi$ , which is responsible for mapping from (action $a$ , state $s$ ) pairs to the probability the agent takes action $a$ when found in state $s$ . That is, $\pi: A \times S \mapsto [0,1]$ , where

$\pi(a,s) = P(a_t = a | s_t = s)$

This function completely describes the agent’s beliefs about the actions it should take in any state. Note that $p(a|s_t=s)$ is a distribution over the action space $\mathcal{A}$ ; when actually making a decision in state $s$ , the agent samples an action from this distribution. Thus this formulation is extremely general, allowing both stochastic policies or deterministic function (by allocating all mass to one action).

Quantities

Policies can also be seen as a map of states to actions
Want to maximize the sum of rewards while prioritizing rewards available now to those later (greater uncertainty about the future)
One way to encode this is to define the utility to be the sum of future discounted rewards by a discount rate $\gamma$ for each step in the future
Values are the expected future utility from a state (max node)
Q-values are the expected future utility from a Q-state (chance node)

Optimal Quantities

Value of a state is $V^*(s) =$ expected utility starting from state $s$ and acting optimally from that point forward
Value of Q-state is $Q^*(s,a) =$ expected utility starting out taking action $a$ from state $s$ and acting optimally from that point forward (like the regular value, but factors in an action $a$ ).
Optimal policy $\pi^*(s) =$ the optimal action from state $s$

Value of states

Recursive definition of value, gives the expected utility under optimal action. This is what expectimax computed, except it wasted a lot of computation by not reusing previously computed values. Note also that so long as $\gamma < 1$ , deep parts of the tree will eventually not matter; “natural” depth limit $V^*(s) = \max_a Q^*(s,a)$ $Q^*(s,a) = \sum_{s^\prime}T(s,a,s^\prime)[R(s,a,s^\prime)+\gamma V^*(s^\prime)]$ $V^*(s) = \max_a \sum_{s^\prime}T(s,a,s^\prime)[R(s,a,s^\prime)+\gamma V^*(s^\prime)]$

Time Limited Values

Define $V_k(s)$ to be optimal value of state $s$ if the game ends in $k$ additional time steps. This is equivalent to a depth- $k$ expectimax calculation

Bellman Equations

The definition of the “optimal utility” under the expectimax recurrence provides a one-step lookahead relationship between optimal utility values. These relationships are described by the Bellman equations, written out above. The simple one-step lookahead is all that’s needed to formulate a recursive computation.

Value Iteration

Value iteration is a fixed point solution method that just computes the optimal values directly according to the Bellman equations.

Start with $V_0(s) = 0$ , where there are no time steps left and the expected reward sum of zero
Then we do one ply of expectimax from each state given a vector of $V_k(s)$ values $V_{k+1}(s) \leftarrow \max_a\sum_{s^\prime}T(s,a,s^\prime)\big[R(s,a,s^\prime+\gamma V_k(s^\prime)\big]$
This is repeated until convergence, has complexity $O(S^2A)$
These values will converge to unique optimal values as we increase $k$ ; real rewards at the bottom of the tree are far out and discounted increasingly heavily as $k$ gets larger
This method is quite intuitive; it sort of defines a base case at $V_0$ , and works up from there just plugging the values into the equation for each additional step

Policy Methods

Fixed policies: a function $\pi(s)$ $π (s)$ mapping from states to actions. Under fixed policies you just select the action returned by policy, simplifying the game tree. That is, there is only one branch from each max node, but when the environment is stochastic that action could take you to a number of different states.
- Utility of a state $s$ under fixed policy $\pi$ is $V^\pi(s) = \sum_{s^\prime}T(s,\pi(s),s^\prime)\big[R(s,\pi(s),s^\prime) + \gamma V^\pi(s^\prime)\big]$ Note that there is no longer a max involved over the actions; the fixed policy gives that action

Policy Evaluation

Can calculate the values for a fixed policy $\pi$ by turning the recursion into updates, just like value iteration. Starting with $V_0^\pi(s)=0$ , plug in successive $V_k^\pi$ ’s and increase $k$ .
Can also just solve the linear system of equations now that the maxes are gone

Policy Extraction

If we have the optimal values $V^*(s)$ $V^{*} (s)$ , we need to perform a one step expectimax in order to extract the policy implied by the values: $\pi^*(s) = \argmax_a\sum_{s^\prime}T(s,a,s^\prime)\big[R(s,a,s^\prime)+\gamma V^*(s^\prime)\big]$
- That is, in order to find the best action to take in the current state, we need to compute the above sum to get the expected sum of discounted rewards for each action, and then choose the action that yields the maximum.
If we have the optimal q-values $Q^*(s,a)$ , it’s very easy to make the recover the optimal action: $\pi^*(s) = \argmax_aQ^*(s,a)$ since q-values don’t already perform a maximization, and all that remains is finding the action that corresponds to that maximum.

Policy Iteration

Policy iteration uses what we saw for fixed policies in policy evaluation, and uses those estimated values to update the policy itself. This place according to a two step process, where we first assume a fixed policy and evaluate it, and then use those estimates to reassign optimal actions in the policy.

Step 1: Policy evaluation - calculate utilities for some fixed policy until convergence (note these aren’t optimal utilities) $V_{k+1}^{\pi_i}(s) \leftarrow \sum_{s^\prime}T(s,\pi_i(s),s^\prime)\big[R(s,\pi_i(s),s^\prime) + \gamma V^{\pi_i}_k(s^\prime)\big]$
Step 2: Policy improvement - update policy using one-step look-ahead with converged utilities as future values $\pi_{i+1}(s) = \argmax_a\sum_{s^\prime}T(s,a,s^\prime)\big[R(s,a,s^\prime)+\gamma V^{\pi_i}(s^\prime)\big]$
Repeat until convergence

Summary

Value iteration allows us to approximate the Bellman equations by incrementally solving for state values. We work up from the bottom $V_0(s) = 0$ $V_{0} (s) = 0$ , and iteratively plug these values in until our policy converges.
- Note that while value iteration does not concern itself with tracking a policy explicitly, we are implicitly recomputing a policy each iteration by taking the maximum over the actions. As we update our state values, an (implicit) policy updates with it.
Policy extraction extracts an optimal policy from a given set of state values $V^*(s)$ $V^{*} (s)$ . This method just defines the way that we would go about pulling out the best action by using a depth-one sum on our pre-computed state values.
- This could in theory be coupled with the value iteration technique to pull out an explicit policy from the converged state value estimates.
Policy evaluation evaluates a fixed policy by simply following that policy through the Bellman equations. The policy mapping from states to a single action greatly reduces the complexity of the computation since there is only one branch to follow.
Policy iteration computes explicitly an optimal policy by computing state values under a fixed policy using policy evaluation, and then uses the estimated values to explicitly recompute the optimal actions for the policy.
Solving MDPs is entirely an offline planning task; you need all the MDP details and you compute all the values and/or optimal policies before actually acting in the environment or playing the game. Reinforcement learning deals with learning how to solve MDPs by actually acting in an environment when we don’t have all the underlying MDP details beforehand

Markov decision process