Example3.8 gridworld

xiaoxiao2021-02-28 127

Example 3.8: Gridworld Figure (a) uses a rectangular grid to illustrate value functions for a simple finite MDP. The cells of the grid correspond to the states of the environment. At each cell, four actions are possible: north, south, east, and west, which deterministically cause the agent to move one cell in the respective direction on the grid. Actions that would take the agent off the grid leave its location unchanged, but also result in a reward of −1 . Other actions result in a reward of 0 , except those that move the agent out of the special states A and B. From state A, all four actions yield a reward of +10 and take the agent to A’. From state B, all actions yield a reward of +5 and take the agent to B’.

Suppose the agent selects all four actions with equal probability in all states. Figure (b) shows the value function, , for this policy, for the discounted reward case with γ=0.9 . This value function was computed by solving the system of equations (3.10) .

大意：如图a所示，有一个5X5的网格，假设有个agent（可理解为智能体或机器人）在这些网格中走路，每次走一步，方向有四个：东南西北，并且这四个方向的概率相等，都为0.25。如果agent处于边缘，若它试图走出网格，会被弹回原来的位置，并获得-1的奖励值，其它的从一个网格走到另一个网格的行动不会获得奖励，也就是奖励值为0。另外，还有A和B两种特殊情况，处于A位置的agent只能到达A’的位置，并获得+10的奖励值；处于B位置的agent只能到达B’的位置，并获得+5的奖励值。奖励的折扣为0.9.

根据公式3.10计算价值函数：

Vπ(s)=∑aπ(s,a)∑s′Pass′[Rass′+γVπ(s′)]

在这个问题中， π(s,a) 表示在某个网格位置选择某个行动的概率，为0.25，而 Pass′ 为从状态s到s’选择行动a的概率，由于在这个问题，在某个状态s，执行动作a到达状态s’的概率为1，因此 Pass′=1 ，公式可以简化为：

Vπ(s)=∑aπ(s,a)(Rass′+γVπ(s′))

代码如下：

import numpy as np arr = np.zeros([7,7]) for i in range(7): arr[i][0] = -1 arr[0][i] = -1 arr[i][6] = -1 arr[6][i] = -1 def calculatev(i,j): z = 0 if i == 1 and j == 2: arr[i][j] = 10 + 0.9 * arr[5][2] return arr[i][j] if i == 1 and j == 4: arr[i][j] = 5 + 0.9 * arr[3][4] return arr[i][j] z = 0.25*0.9*(arr[i-1][j] + arr[i+1][j] + arr[i][j-1] + arr[i][j+1]) if i == 1 or i == 5: z = z + 0.25*0.9*arr[i][j] if j == 1 or j == 5: z = z + 0.25*0.9*arr[i][j] arr[i][j] = z return arr[i][j] for k in range(100): for i in range(1,6): for j in range(1,6): arr[i][j] = calculatev(i,j)

结果如下：

arr[1:6,1:6]: array([[ 3.5618217 , 8.98001683, 4.58600804, 5.44132861, 1.70739673], [ 1.72665845, 3.14095383, 2.37490449, 2.03375386, 0.73230784], [ 0.24460358, 0.87821502, 0.79441533, 0.49036512, -0.21875692], [-0.76235002, -0.27679488, -0.21274981, -0.43001174, -0.97616901], [-1.593681 , -1.13331464, -1.03316344, -1.21260961, -1.71359125]])

和图b相近。我也不知道为啥没有完全一样=。=

Notice the negative values near the lower edge; these are the result of the high probability of hitting the edge of the grid there under the random policy. State A is the best state to be in under this policy, but its expected return is less than 10, its immediate reward,because from A the agent is taken to , from which it is likely to run into the edge of the grid. State B, on the other hand, is valued more than 5, its immediate reward, because from B the agent is taken to , which has a positive value. From the expected penalty (negative reward) for possibly running into an edge is more than compensated for by the expected gain for possibly stumbling onto A or B.

转载请注明原文地址: https://www.6miu.com/read-21557.html

技术

最新回复(0)