Markov decision process

Markov decision process

Example: Grid World

Example environment to be used frequently; 80% of time, agent takes desired action, 10% of the time it goes left of desired action, 10% of time it goes right. Receives small “living” rewards for each step, and big rewards (positive or negative) at the end of the game.

Markov Decision Processes

A Markov decision process (MDP) is a 4-tuple (S,A,Pa,Ra)(\mathcal{S}, \mathcal{A}, P_a, R_a) where

  • S\mathcal{S}: finite set of possible states, known as the state space

  • A\mathcal{A}: finite set of possible actions, known as the action space; if actions are conditional on state, the notation As\mathcal{A}_s indicates the set of actions available from state ss

  • PaP_a: probabilistic state transition function Pa:S×S[0,1]P_a: \mathcal{S} \times \mathcal{S} \mapsto [0,1], giving the probability that taking action aa in state ss at time tt will result in state ss^{\prime} at time t+1t+1:

    Pa(s,s)=P(st+1=sst=s,at=a)P_a(s,s^{\prime}) = P(s_{t+1}=s^{\prime} | s_t=s,a_t=a)

    Note that PaP_a is a family of functions for the various actions aAa \in \mathcal{A} that one might want to reason about (i.e. PaP_a in its general form is defined without specific aa, but it is not a function in this state; a concrete value for aa must first be determined. Additionally, plotting PaP_a does not give a meaningful plot as has been discussed extensively when interpreting probability densities; you would need to plot it in conjunction with a prior over actions in order to recover the joint density.) Another common formulation of the transition function is T(s,a,s)=P(ss,a)T(s,a,s^\prime) = P(s^\prime|s,a) which encodes the action aa explicitly in the function arguments.

  • RaR_a: immediate reward received for transitioning from state ss to ss^{\prime} by action aa; commonly defined as a function Ra:S×SRR_a: \mathcal{S} \times \mathcal{S} \mapsto \mathbb{R}

  • A start state s0Ss_0 \in \mathcal{S} and possibly a terminal state

Markov Property

MDPs have the Markov property which says that the current state tells you everything you can know about the future from the past. That is, given the current state the future and past are independent. We can typically make this happen by pushing enough information into the state so it encodes all relevant information from the past.

Policies

An agent implements a decision making process according to a policy function π\pi, which is responsible for mapping from (action aa, state ss) pairs to the probability the agent takes action aa when found in state ss. That is, π:A×S[0,1]\pi: A \times S \mapsto [0,1], where

π(a,s)=P(at=ast=s)\pi(a,s) = P(a_t = a | s_t = s)

This function completely describes the agent’s beliefs about the actions it should take in any state. Note that p(ast=s)p(a|s_t=s) is a distribution over the action space A\mathcal{A}; when actually making a decision in state ss, the agent samples an action from this distribution. Thus this formulation is extremely general, allowing both stochastic policies or deterministic function (by allocating all mass to one action).

Quantities

  • Policies can also be seen as a map of states to actions
  • Want to maximize the sum of rewards while prioritizing rewards available now to those later (greater uncertainty about the future)
  • One way to encode this is to define the utility to be the sum of future discounted rewards by a discount rate γ\gamma for each step in the future
  • Values are the expected future utility from a state (max node)
  • Q-values are the expected future utility from a Q-state (chance node)

Optimal Quantities

  • Value of a state is V(s)=V^*(s) = expected utility starting from state ss and acting optimally from that point forward
  • Value of Q-state is Q(s,a)=Q^*(s,a) = expected utility starting out taking action aa from state ss and acting optimally from that point forward (like the regular value, but factors in an action aa).
  • Optimal policy π(s)=\pi^*(s) = the optimal action from state ss

Value of states

Recursive definition of value, gives the expected utility under optimal action. This is what expectimax computed, except it wasted a lot of computation by not reusing previously computed values. Note also that so long as γ<1\gamma < 1, deep parts of the tree will eventually not matter; “natural” depth limit V(s)=maxaQ(s,a)V^*(s) = \max_a Q^*(s,a) Q(s,a)=sT(s,a,s)[R(s,a,s)+γV(s)]Q^*(s,a) = \sum_{s^\prime}T(s,a,s^\prime)[R(s,a,s^\prime)+\gamma V^*(s^\prime)] V(s)=maxasT(s,a,s)[R(s,a,s)+γV(s)]V^*(s) = \max_a \sum_{s^\prime}T(s,a,s^\prime)[R(s,a,s^\prime)+\gamma V^*(s^\prime)]

Time Limited Values

Define Vk(s)V_k(s) to be optimal value of state ss if the game ends in kk additional time steps. This is equivalent to a depth-kk expectimax calculation

Bellman Equations

The definition of the “optimal utility” under the expectimax recurrence provides a one-step lookahead relationship between optimal utility values. These relationships are described by the Bellman equations, written out above. The simple one-step lookahead is all that’s needed to formulate a recursive computation.

Value Iteration

Value iteration is a fixed point solution method that just computes the optimal values directly according to the Bellman equations.

  • Start with V0(s)=0V_0(s) = 0, where there are no time steps left and the expected reward sum of zero
  • Then we do one ply of expectimax from each state given a vector of Vk(s)V_k(s) values Vk+1(s)maxasT(s,a,s)[R(s,a,s+γVk(s)]V_{k+1}(s) \leftarrow \max_a\sum_{s^\prime}T(s,a,s^\prime)\big[R(s,a,s^\prime+\gamma V_k(s^\prime)\big]
  • This is repeated until convergence, has complexity O(S2A)O(S^2A)
  • These values will converge to unique optimal values as we increase kk; real rewards at the bottom of the tree are far out and discounted increasingly heavily as kk gets larger
  • This method is quite intuitive; it sort of defines a base case at V0V_0, and works up from there just plugging the values into the equation for each additional step

Policy Methods

  • Fixed policies: a function π(s)\pi(s) mapping from states to actions. Under fixed policies you just select the action returned by policy, simplifying the game tree. That is, there is only one branch from each max node, but when the environment is stochastic that action could take you to a number of different states.
    • Utility of a state ss under fixed policy π\pi is Vπ(s)=sT(s,π(s),s)[R(s,π(s),s)+γVπ(s)]V^\pi(s) = \sum_{s^\prime}T(s,\pi(s),s^\prime)\big[R(s,\pi(s),s^\prime) + \gamma V^\pi(s^\prime)\big] Note that there is no longer a max involved over the actions; the fixed policy gives that action

Policy Evaluation

  • Can calculate the values for a fixed policy π\pi by turning the recursion into updates, just like value iteration. Starting with V0π(s)=0V_0^\pi(s)=0, plug in successive VkπV_k^\pi’s and increase kk.
  • Can also just solve the linear system of equations now that the maxes are gone

Policy Extraction

  • If we have the optimal values V(s)V^*(s), we need to perform a one step expectimax in order to extract the policy implied by the values: π(s)=arg maxasT(s,a,s)[R(s,a,s)+γV(s)]\pi^*(s) = \argmax_a\sum_{s^\prime}T(s,a,s^\prime)\big[R(s,a,s^\prime)+\gamma V^*(s^\prime)\big]
    • That is, in order to find the best action to take in the current state, we need to compute the above sum to get the expected sum of discounted rewards for each action, and then choose the action that yields the maximum.
  • If we have the optimal q-values Q(s,a)Q^*(s,a), it’s very easy to make the recover the optimal action: π(s)=arg maxaQ(s,a)\pi^*(s) = \argmax_aQ^*(s,a) since q-values don’t already perform a maximization, and all that remains is finding the action that corresponds to that maximum.

Policy Iteration

Policy iteration uses what we saw for fixed policies in policy evaluation, and uses those estimated values to update the policy itself. This place according to a two step process, where we first assume a fixed policy and evaluate it, and then use those estimates to reassign optimal actions in the policy.

  • Step 1: Policy evaluation - calculate utilities for some fixed policy until convergence (note these aren’t optimal utilities) Vk+1πi(s)sT(s,πi(s),s)[R(s,πi(s),s)+γVkπi(s)]V_{k+1}^{\pi_i}(s) \leftarrow \sum_{s^\prime}T(s,\pi_i(s),s^\prime)\big[R(s,\pi_i(s),s^\prime) + \gamma V^{\pi_i}_k(s^\prime)\big]
  • Step 2: Policy improvement - update policy using one-step look-ahead with converged utilities as future values πi+1(s)=arg maxasT(s,a,s)[R(s,a,s)+γVπi(s)]\pi_{i+1}(s) = \argmax_a\sum_{s^\prime}T(s,a,s^\prime)\big[R(s,a,s^\prime)+\gamma V^{\pi_i}(s^\prime)\big]
  • Repeat until convergence

Summary

  • Value iteration allows us to approximate the Bellman equations by incrementally solving for state values. We work up from the bottom V0(s)=0V_0(s) = 0, and iteratively plug these values in until our policy converges.
    • Note that while value iteration does not concern itself with tracking a policy explicitly, we are implicitly recomputing a policy each iteration by taking the maximum over the actions. As we update our state values, an (implicit) policy updates with it.
  • Policy extraction extracts an optimal policy from a given set of state values V(s)V^*(s). This method just defines the way that we would go about pulling out the best action by using a depth-one sum on our pre-computed state values.
    • This could in theory be coupled with the value iteration technique to pull out an explicit policy from the converged state value estimates.
  • Policy evaluation evaluates a fixed policy by simply following that policy through the Bellman equations. The policy mapping from states to a single action greatly reduces the complexity of the computation since there is only one branch to follow.
  • Policy iteration computes explicitly an optimal policy by computing state values under a fixed policy using policy evaluation, and then uses the estimated values to explicitly recompute the optimal actions for the policy.
  • Solving MDPs is entirely an offline planning task; you need all the MDP details and you compute all the values and/or optimal policies before actually acting in the environment or playing the game. Reinforcement learning deals with learning how to solve MDPs by actually acting in an environment when we don’t have all the underlying MDP details beforehand