To form control methods, we need to couple such action-value prediction methods with techniques for policy improvement and action selection. 通过将估计策略变为贪心策略的一个柔性近似,如$\epsilon$-贪心策略,策略改进便完成了。也是用同样的策略来选取动作。
Example 10.1: 高山行车(Mountain Car)问题
该问题的困难在于重力比汽车发动机更强,即使在全油门下汽车也不能驶上陡坡。The only solution is to first move away from the goal and up the opposite slope on the left. Then, by applying full throttle the car can build up enough inertia to carry it up the steep slope even though it is slowing down the whole way. 在某种意义上,事情会先变得更糟(离目标越远),然后才变得更好。
- 这个任务中所有的真实价值都是负数,而最初的动作价值函数都是零,这是乐观的,这使得即使试探参数$\epsilon$为0,也会引起广泛的试探。
- 我们在这里不使用蒙特卡洛方法,是因为该问题可能会出现一幕过长,甚至永远不结束的情况,从而导致训练无法进行。
- 图10.4中较大的n比较小的n有更高的标准差。原因是n-步方法是必须等到n步之后才能更新,而且辐射到的状态也会越多。在初期agent会有一些很差的action,n越大这些差的估计就会传到更多的状态,导致一些好的状态的初始估计离真实值差距很大,从而导致方差变大。
The average-reward setting is one of the major settings commonly considered in the classical theory of dynamic programming and less-commonly in reinforcement learning. The discounted setting is problematic with function approximation, and thus the average-reward setting is needed to replace it.
In particular, we consider all policies that attain the maximal value of $r(\pi)$ to be optimal.
Example 10.2: 访问控制队列(Access-Control Queuing)任务
The continuing, discounted problem formulation has been very useful in the tabular case, in which the returns from each state can be separately identified and averaged. But in the approximate case it is questionable whether one should ever use this problem formulation.
为了搞清楚为什么,我们考虑一个没有开始或结束的无限长的收益序列,也没有清晰定义的可区分的状态。这些状态可能仅仅由特征向量来表示,而它们对于区分不同的状态可能作用不大。作为一个特例,所有特征向量可能都是一样的。Thus one really has only the reward sequence (and the actions), and performance has to be assessed purely from these. How could it be done?一种方法是通过计算较长时间间隔的收益的平均来进行性能评估——这就是平均收益的设定。那么如何使用折扣呢?对于每一个timestep我们可以计算折后回报。有些回报很大,有些很小,所以我们仍然需要在足够大的时间间隔中进行平均。在持续性问题上没有开始和结束,没有特殊的时刻。事实上,对于策略$\pi$,折后回报的均值总是$r(\pi)/(1-\gamma)$,也就是说,它本质上就是平均收益。In particular, the ordering of all policies in the average discounted return setting would be exactly the same as in the average-reward setting.
这表明折扣在连续型任务的定义中不起作用。但是我们仍然可以将折扣用于解决方案中。Unfortunately, discounting algorithms with function approximation do not optimize discounted value over the on-policy distribution, and thus are not guaranteed to optimize average reward(在表格型下是可以的).
事实上,策略改进定理的缺失也是分幕式设定以及平均收益设定的理论缺陷。一旦引入了函数逼近,我们无法保证在任何设定下都一定会有策略的改进。In Chapter 13 we introduce an alternative class of reinforcement learning algorithms based on parameterized policies, and there we have a theoretical guarantee called the “policy-gradient theorem” which plays a similar role as the policy improvement theorem. 但是对于学习动作价值的方法,还没有一个局部的改进保证(possibly the approach taken by Perkins and Precup (2003) may provide a part of the answer)。我们确实知道的是,$\epsilon$-贪心优化有时会导致一个较差的策略,因为算法可能在若干好的策略之间来回摆动而不收敛到其中的一个(Gordon,1996)。
在上述算法中,平均收益的步长参数$\beta$需要足够小,以保证$\bar{R}$是一个较好的对平均收益的长期估计。Unfortunately, $\bar{R}$ will then be biased by its initial value for many steps, which may make learning inefficient. 因而可以采用采样观测的收益的平均值来代替$\bar{R}$。其会在前期迅速适应,但长期来看仍然适应得很慢。而且由于策略缓慢的变化,$\bar{R}$也会发生变化。这种可能存在的长期非平稳性使得采样平均法并不适用。事实上,平均收益的步长参数是使用练习2.7中无偏常数不长的技巧的最佳之地。