蒙特卡洛树学习笔记
1. 强化学习(RL)
概念
强化学习是机器学习中的一个领域,强调如何基于环境而行动,以取得最大化的预期利益。其灵感来源于心理学中的行为主义理论,即有机体如何在环境给予的奖励或惩罚的刺激下,逐步形成对刺激的预期,产生能获得最大利益的习惯性行为。这个方法具有普适性,因此在其他许多领域都有研究,例如博弈论、控制论、运筹学、信息论、仿真优化、多主体系统学习、群体智能、统计学以及遗传算法。在运筹学和控制理论研究的语境下,强化学习被称作“近似动态规划”(approximate dynamic programming,ADP)。在最优控制理论中也有研究这个问题,虽然大部分的研究是关于最优解的存在和特性,并非是学习或者近似方面。在经济学和博弈论中,强化学习被用来解释在有限理性的条件下如何出现平衡。
在机器学习问题中,环境通常被规范为马尔可夫决策过程(MDP),所以许多强化学习算法在这种情况下使用动态规划技巧。传统的技术和强化学习算法的主要区别是,后者不需要关于MDP的知识,而且针对无法找到确切方法的大规模MDP。
强化学习和标准的监督式学习之间的区别在于,它并不需要出现正确的输入/输出对,也不需要精确校正次优化的行为。强化学习更加专注于在线规划,需要在探索(在未知的领域)和遵从(现有知识)之间找到平衡。强化学习中的“探索-遵从”的交换,在多臂老虎机(英语:multi-armed bandit)问题和有限MDP中研究得最多。
强化学习的基本组件
- 
环境/状态(标准的为静态stationary,对应的non-stationary) 
- 
agent(与环境交互的对象) 
- 
动作(action space,环境下可行的动作集合,离散/连续) 
- 
反馈(回报,reward,正是有了反馈,RL才能迭代,才会学习到策略链) 
2. 马尔可夫决策过程(MDP)
马尔可夫过程
在概率论及统计学中,马尔可夫过程(Markov process)又叫马尔可夫链(Markov Chain),是一个具备了马尔可夫性质的随机过程,因为俄国数学家安德雷·马尔可夫得名。马尔可夫过程是不具备记忆特质的(memorylessness)。换言之,马尔可夫过程的条件概率仅仅与系统的当前状态相关,而与它的过去历史或未来状态,都是独立、不相关的。马尔可夫过程可以用一个元组<S,P>表示,其中S是有限数量的状态集,P是状态转移概率矩阵。
马尔可夫奖励过程
马尔可夫奖励过程(Markov Reward Process)在马尔可夫过程的基础上增加了奖励R和衰减系数γ:<S,P,R,γ>。R是一个奖励函数。S状态下的奖励是某一时刻(t)处在状态s下在下一个时刻(t+1)能获得的奖励期望(当进入某个状态会获得相应的奖励)。
马尔可夫决策过程
相较于马尔可夫奖励过程,马尔可夫决策过程(Markov Decision Process)多了一个行为集合A,它是这样的一个元组: <S, A, P, R, γ>。看起来很类似马尔可夫奖励过程,但这里的P和R都与具体的行为a对应,而不像马尔可夫奖励过程那样仅对应于某个状态,A表示的是有限的行为的集合。
RL与MDP
在强化学习中,马尔可夫决策过程是对完全可观测的环境进行描述的,也就是说观测到的状态内容完整地决定了决策的需要的特征。几乎所有的强化学习问题都可以转化为MDP。
引用自:https://zh.wikipedia.org/zh-hans/马可夫过程 | https://zhuanlan.zhihu.com/p/28084942
3. 蒙特卡洛方法(MCM)
MCM简介
蒙特卡罗方法(Monte Carlo Method),也称统计模拟方法,是1940年代中期由于科学技术的发展和电子计算机的发明,而提出的一种以概率统计理论为指导的数值计算方法。是指使用随机数(或更常见的伪随机数)来解决很多计算问题的方法。
20世纪40年代,在冯·诺伊曼,斯塔尼斯拉夫·乌拉姆和尼古拉斯·梅特罗波利斯在洛斯阿拉莫斯国家实验室为核武器计划工作时,发明了蒙特卡罗方法。因为乌拉姆的叔叔经常在摩纳哥的蒙特卡洛赌场输钱得名,而蒙特卡罗方法正是以概率为基础的方法。
通常蒙特卡罗方法可以粗略地分成两类:一类是所求解的问题本身具有内在的随机性,借助计算机的运算能力可以直接模拟这种随机的过程。例如在核物理研究中,分析中子在反应堆中的传输过程。中子与原子核作用受到量子力学规律的制约,人们只能知道它们相互作用发生的概率,却无法准确获得中子与原子核作用时的位置以及裂变产生的新中子的行进速率和方向。科学家依据其概率进行随机抽样得到裂变位置、速度和方向,这样模拟大量中子的行为后,经过统计就能获得中子传输的范围,作为反应堆设计的依据。
另一种类型是所求解问题可以转化为某种随机分布的特征数,比如随机事件出现的概率,或者随机变量的期望值。通过随机抽样的方法,以随机事件出现的频率估计其概率,或者以抽样的数字特征估算随机变量的数字特征,并将其作为问题的解。这种方法多用于求解复杂的多维积分问题。
一个例子
使用蒙特卡罗方法估算π值。放置30000个随机点后,π的估算值与真实值相差0.07%。

前景
就单纯的用蒙特卡洛方法来下棋(最早在1993年被提出,后在2001被再次提出),我们可以简单的用随机比赛的方式来评价某一步落子。从需要评价的那一步开始,双方随机落子,直到一局比赛结束。为了保证结果的准确性,这样的随机对局通常需要进行上万盘,记录下每一盘的结果,最后取这些结果的平均,就能得到某一步棋的评价。最后要做的就是取评价最高的一步落子作为接下来的落子。也就是说为了决定一步落子就需要程序自己进行上万局的随机对局,这对随机对局的速度也提出了一定的要求。和使用了大量围棋知识的传统方法相比,这种方法的好处显而易见,就是几乎不需要围棋的专业知识,只需通过大量的随机对局就能估计出一步棋的价值。再加上一些优化方法,基于纯蒙特卡洛方法的围棋程序已经能够匹敌最强的传统围棋程序。
既然蒙特卡洛的路似乎充满着光明,我们就应该沿着这条路继续前行。MCTS也就是将以上想法融入到树搜索中,利用树结构来更加高效的进行节点值的更新和选择。
引用自:https://blog.csdn.net/natsu1211/article/details/50986810
4. 蒙特卡洛树搜索(MCTS)
MCTS简介
蒙特卡洛树搜索(Monte Carlo tree search;MCTS)是一种用于某些决策过程的启发式搜索算法,最引人注目的是在游戏中的使用。一个主要例子是电脑围棋程序,它也用于其他棋盘游戏、即时电子游戏以及不确定性游戏。
搜索步骤
解释一
- 
选举(selection)是根据当前获得所有子步骤的统计结果,选择一个最优的子步骤。 
- 
扩展(expansion)在当前获得的统计结果不足以计算出下一个步骤时,随机选择一个子步骤。 
- 
模拟(simulation)模拟游戏,进入下一步。 
- 
反向传播(Back-Propagation)根据游戏结束的结果,计算对应路径上统计记录的值。 
解释二
- 
选择(Selection):从根结点R开始,选择连续的子结点向下至叶子结点L。后面给出了一种选择子结点的方法,让游戏树向最优的方向扩展,这是蒙特卡洛树搜索的精要所在。 
- 
扩展(Expansion):除非任意一方的输赢使得游戏在L结束,否则创建一个或多个子结点并选取其中一个结点C。 
- 
仿真(Simulation):在从结点C开始,用随机策略进行游戏,又称为playout或者rollout。 
- 
反向传播(Backpropagation):使用随机游戏的结果,更新从C到R的路径上的结点信息。 
图解
详细算法
在开始阶段,搜索树只有一个节点,也就是我们需要决策的局面。
搜索树中的每一个节点包含了三个基本信息:代表的局面,被访问的次数,累计评分。
- 
选择(Selection) 在选择阶段,需要从根节点,也就是要做决策的局面R出发向下选择出一个最急迫需要被拓展的节点N,局面R是是每一次迭代中第一个被检查的节点; 对于被检查的局面而言,他可能有三种可能: - 该节点所有可行动作都已经被拓展过
- 该节点有可行动作还未被拓展过
- 这个节点游戏已经结束了(例如已经连成五子的五子棋局面)
 对于这三种可能: - 如果所有可行动作都已经被拓展过了,那么我们将使用UCB公式计算该节点所有子节点的UCB值,并找到值最大的一个子节点继续检查。反复向下迭代。
- 如果被检查的局面依然存在没有被拓展的子节点(例如说某节点有20个可行动作,但是在搜索树中才创建了19个子节点),那么我们认为这个节点就是本次迭代的的目标节点N,并找出N还未被拓展的动作A。执行步骤[2]
- 如果被检查到的节点是一个游戏已经结束的节点。那么从该节点直接执行步骤[4]。
 每一个被检查的节点的被访问次数在这个阶段都会自增。 在反复的迭代之后,我们将在搜索树的底端找到一个节点,来继续后面的步骤。 
- 
拓展(Expansion) 在选择阶段结束时候,我们查找到了一个最迫切被拓展的节点N,以及他一个尚未拓展的动作A。在搜索树中创建一个新的节点$N_n$作为N的一个新子节点。$N_n$的局面就是节点N在执行了动作A之后的局面。 
- 
模拟(Simulation) 为了让$N_n$得到一个初始的评分。我们从$N_n$开始,让游戏随机进行,直到得到一个游戏结局,这个结局将作为$N_n$的初始评分。一般使用胜利/失败来作为评分,只有1或者0。 
- 
反向传播(Backpropagation) 在$N_n$的模拟结束之后,它的父节点N以及从根节点到N的路径上的所有节点都会根据本次模拟的结果来添加自己的累计评分。如果在[1]的选择中直接发现了一个游戏结局的话,根据该结局来更新评分。 每一次迭代都会拓展搜索树,随着迭代次数的增加,搜索树的规模也不断增加。当到了一定的迭代次数或者时间之后结束,选择根节点下最好的子节点作为本次决策的结果。 
引用自:https://www.zhihu.com/question/39916945/answer/83799720
另可参考:https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/
算法伪代码


引用自:https://blog.csdn.net/u014397729/article/details/27366363
5. 上限置信区间算法(UCT)
UCB1
$$\frac{w_i}{n_i}+c \sqrt{ \frac{ln\ t}{n_i} }$$在该式中:
- $w_{i}$ 代表第 $i$ 次移动后取胜的次数;
- $n_i$ 代表第 $i$ 次移动后仿真的次数;
- $c$ 为探索参数—理论上等于 $\sqrt {2}$;在实际中通常可凭经验选择;
- $t$ 代表仿真总次数,等于所有 $n_i$ 的和。
其中,C越大,就会越照顾访问次数相对较少的子节点。
UCT介绍
UCT 算法(Upper Confidence Bound Apply to Tree)即上限置信区间算法,是一种博弈树搜索算法,该算法将蒙特卡洛树搜索方法与UCB公式结合,在超大规模博弈树的搜索过程中相对于传统的搜索算法有着时间和空间方面的优势。
即:MCTS + UCB1 = UCT
算法中的 UCB 公式可替换为:UCB1-tuned 等
引用自:https://blog.csdn.net/xbinworld/article/details/79372777
优点
MCTS 提供了比传统树搜索更好的方法。
- 
Aheuristic 启发式 MCTS 不要求任何关于给定的领域策略或者具体实践知识来做出合理的决策。这个算法可以在没有任何关于博弈游戏除基本规则外的知识的情况下进行有效工作;这意味着一个简单的MCTS 实现可以重用在很多的博弈游戏中,只需要进行微小的调整,所以这也使得 MCTS 是对于一般的博弈游戏的很好的方法。 
- 
Asymmetric 非对称 MCTS 执行一种非对称的树的适应搜索空间拓扑结构的增长。这个算法会更频繁地访问更加有趣的节点,并聚焦其搜索时间在更加相关的树的部分。这使得 MCTS 更加适合那些有着更大的分支因子的博弈游戏,比如说 19X19 的围棋。这么大的组合空间会给标准的基于深度或者宽度的搜索方法带来问题,所以MCTS 的适应性说明它(最终)可以找到那些更加优化的行动,并将搜索的工作聚焦在这些部分。 
- 
任何时间 算法可以在任何时间终止,并返回当前最有的估计。当前构造出来的搜索树可以被丢弃或者供后续重用。(对比dfs暴力搜索) 
- 
简洁 算法实现非常方便( http://mcts.ai/code/python.html ) 
缺点
MCTS 有缺点很少,但这些缺点也可能是非常关键的影响因素。
- 
行为能力 MCTS 算法,根据其基本形式,在某些甚至不是很大的博弈游戏中在可承受的时间内也不能够找到最好的行动方式。这基本上是由于组合步的空间的全部大小所致,关键节点并不能够访问足够多的次数来给出合理的估计。 
- 
速度 MCTS 搜索可能需要足够多的迭代才能收敛到一个很好的解上,这也是更加一般的难以优化的应用上的问题。例如,最佳的围棋程序可能需要百万次的交战和领域最佳和强化才能得到专家级的行动方案,而最有的GGP 实现对更加复杂的博弈游戏可能也就只要每秒钟数十次(领域无关的)交战。对可承受的行动时间,这样的GGP 可能很少有时间访问到每个合理的行动,所以这样的情形也不大可能出现表现非常好的搜索。 
6. 详细示例代码
  1# This is a very simple implementation of the UCT Monte Carlo Tree Search algorithm in Python 2.7 (convert to Python3).
  2# The function UCT(rootstate, itermax, verbose = False) is towards the bottom of the code.
  3# It aims to have the clearest and simplest possible code, and for the sake of clarity, the code
  4# is orders of magnitude less efficient than it could be made, particularly by using a 
  5# state.GetRandomMove() or state.DoRandomRollout() function.
  6# 
  7# Example GameState classes for Nim, OXO and Othello are included to give some idea of how you
  8# can write your own GameState use UCT in your 2-player game. Change the game to be played in 
  9# the UCTPlayGame() function at the bottom of the code.
 10# 
 11# Written by Peter Cowling, Ed Powley, Daniel Whitehouse (University of York, UK) September 2012.
 12# 
 13# Licence is granted to freely use and distribute for any sensible/legal purpose so long as this comment
 14# remains in any distributed code.
 15# 
 16# For more information about Monte Carlo Tree Search check out our web site at www.mcts.ai
 17
 18from math import *
 19import random
 20
 21
 22class GameState:
 23    """ A state of the game, i.e. the game board. These are the only functions which are
 24        absolutely necessary to implement UCT in any 2-player complete information deterministic 
 25        zero-sum game, although they can be enhanced and made quicker, for example by using a 
 26        GetRandomMove() function to generate a random move during rollout.
 27        By convention the players are numbered 1 and 2.
 28    """
 29
 30    def __init__(self):
 31        self.playerJustMoved = 2  # At the root pretend the player just moved is player 2 - player 1 has the first move
 32
 33    def Clone(self):
 34        """ Create a deep clone of this game state.
 35        """
 36        st = GameState()
 37        st.playerJustMoved = self.playerJustMoved
 38        return st
 39
 40    def DoMove(self, move):
 41        """ Update a state by carrying out the given move.
 42            Must update playerJustMoved.
 43        """
 44        self.playerJustMoved = 3 - self.playerJustMoved
 45
 46    def GetMoves(self):
 47        """ Get all possible moves from this state.
 48        """
 49
 50    def GetResult(self, playerjm):
 51        """ Get the game result from the viewpoint of playerjm. 
 52        """
 53
 54    def __repr__(self):
 55        """ Don't need this - but good style.
 56        """
 57        pass
 58
 59
 60class NimState:
 61    """ A state of the game Nim. In Nim, players alternately take 1,2 or 3 chips with the 
 62        winner being the player to take the last chip. 
 63        In Nim any initial state of the form 4n+k for k = 1,2,3 is a win for player 1
 64        (by choosing k) chips.
 65        Any initial state of the form 4n is a win for player 2.
 66    """
 67
 68    def __init__(self, ch):
 69        self.playerJustMoved = 2  # At the root pretend the player just moved is p2 - p1 has the first move
 70        self.chips = ch
 71
 72    def Clone(self):
 73        """ Create a deep clone of this game state.
 74        """
 75        st = NimState(self.chips)
 76        st.playerJustMoved = self.playerJustMoved
 77        return st
 78
 79    def DoMove(self, move):
 80        """ Update a state by carrying out the given move.
 81            Must update playerJustMoved.
 82        """
 83        assert move >= 1 and move <= 3 and move == int(move)
 84        self.chips -= move
 85        self.playerJustMoved = 3 - self.playerJustMoved
 86
 87    def GetMoves(self):
 88        """ Get all possible moves from this state.
 89        """
 90        return list(range(1, min([4, self.chips + 1])))
 91
 92    def GetResult(self, playerjm):
 93        """ Get the game result from the viewpoint of playerjm. 
 94        """
 95        assert self.chips == 0
 96        if self.playerJustMoved == playerjm:
 97            return 1.0  # playerjm took the last chip and has won
 98        else:
 99            return 0.0  # playerjm's opponent took the last chip and has won
100
101    def __repr__(self):
102        s = "Chips:" + str(self.chips) + " JustPlayed:" + str(self.playerJustMoved)
103        return s
104
105
106class OXOState:
107    """ A state of the game, i.e. the game board.
108        Squares in the board are in this arrangement
109        012
110        345
111        678
112        where 0 = empty, 1 = player 1 (X), 2 = player 2 (O)
113    """
114
115    def __init__(self):
116        self.playerJustMoved = 2  # At the root pretend the player just moved is p2 - p1 has the first move
117        self.board = [0, 0, 0, 0, 0, 0, 0, 0, 0]  # 0 = empty, 1 = player 1, 2 = player 2
118
119    def Clone(self):
120        """ Create a deep clone of this game state.
121        """
122        st = OXOState()
123        st.playerJustMoved = self.playerJustMoved
124        st.board = self.board[:]
125        return st
126
127    def DoMove(self, move):
128        """ Update a state by carrying out the given move.
129            Must update playerToMove.
130        """
131        assert move >= 0 and move <= 8 and move == int(move) and self.board[move] == 0
132        self.playerJustMoved = 3 - self.playerJustMoved
133        self.board[move] = self.playerJustMoved
134
135    def GetMoves(self):
136        """ Get all possible moves from this state.
137        """
138        return [i for i in range(9) if self.board[i] == 0]
139
140    def GetResult(self, playerjm):
141        """ Get the game result from the viewpoint of playerjm. 
142        """
143        for (x, y, z) in [(0, 1, 2), (3, 4, 5), (6, 7, 8), (0, 3, 6), (1, 4, 7), (2, 5, 8), (0, 4, 8), (2, 4, 6)]:
144            if self.board[x] == self.board[y] == self.board[z]:
145                if self.board[x] == playerjm:
146                    return 1.0
147                else:
148                    return 0.0
149        if self.GetMoves() == []: return 0.5  # draw
150        assert False  # Should not be possible to get here
151
152    def __repr__(self):
153        s = ""
154        for i in range(9):
155            s += ".XO"[self.board[i]]
156            if i % 3 == 2: s += "\n"
157        return s
158
159
160class OthelloState:
161    """ A state of the game of Othello, i.e. the game board.
162        The board is a 2D array where 0 = empty (.), 1 = player 1 (X), 2 = player 2 (O).
163        In Othello players alternately place pieces on a square board - each piece played
164        has to sandwich opponent pieces between the piece played and pieces already on the 
165        board. Sandwiched pieces are flipped.
166        This implementation modifies the rules to allow variable sized square boards and
167        terminates the game as soon as the player about to move cannot make a move (whereas
168        the standard game allows for a pass move). 
169    """
170
171    def __init__(self, sz=8):
172        self.playerJustMoved = 2  # At the root pretend the player just moved is p2 - p1 has the first move
173        self.board = []  # 0 = empty, 1 = player 1, 2 = player 2
174        self.size = sz
175        assert sz == int(sz) and sz % 2 == 0  # size must be integral and even
176        for y in range(sz):
177            self.board.append([0] * sz)
178        self.board[sz / 2][sz / 2] = self.board[sz / 2 - 1][sz / 2 - 1] = 1
179        self.board[sz / 2][sz / 2 - 1] = self.board[sz / 2 - 1][sz / 2] = 2
180
181    def Clone(self):
182        """ Create a deep clone of this game state.
183        """
184        st = OthelloState()
185        st.playerJustMoved = self.playerJustMoved
186        st.board = [self.board[i][:] for i in range(self.size)]
187        st.size = self.size
188        return st
189
190    def DoMove(self, move):
191        """ Update a state by carrying out the given move.
192            Must update playerToMove.
193        """
194        (x, y) = (move[0], move[1])
195        assert x == int(x) and y == int(y) and self.IsOnBoard(x, y) and self.board[x][y] == 0
196        m = self.GetAllSandwichedCounters(x, y)
197        self.playerJustMoved = 3 - self.playerJustMoved
198        self.board[x][y] = self.playerJustMoved
199        for (a, b) in m:
200            self.board[a][b] = self.playerJustMoved
201
202    def GetMoves(self):
203        """ Get all possible moves from this state.
204        """
205        return [(x, y) for x in range(self.size) for y in range(self.size) if
206                self.board[x][y] == 0 and self.ExistsSandwichedCounter(x, y)]
207
208    def AdjacentToEnemy(self, x, y):
209        """ Speeds up GetMoves by only considering squares which are adjacent to an enemy-occupied square.
210        """
211        for (dx, dy) in [(0, +1), (+1, +1), (+1, 0), (+1, -1), (0, -1), (-1, -1), (-1, 0), (-1, +1)]:
212            if self.IsOnBoard(x + dx, y + dy) and self.board[x + dx][y + dy] == self.playerJustMoved:
213                return True
214        return False
215
216    def AdjacentEnemyDirections(self, x, y):
217        """ Speeds up GetMoves by only considering squares which are adjacent to an enemy-occupied square.
218        """
219        es = []
220        for (dx, dy) in [(0, +1), (+1, +1), (+1, 0), (+1, -1), (0, -1), (-1, -1), (-1, 0), (-1, +1)]:
221            if self.IsOnBoard(x + dx, y + dy) and self.board[x + dx][y + dy] == self.playerJustMoved:
222                es.append((dx, dy))
223        return es
224
225    def ExistsSandwichedCounter(self, x, y):
226        """ Does there exist at least one counter which would be flipped if my counter was placed at (x,y)?
227        """
228        for (dx, dy) in self.AdjacentEnemyDirections(x, y):
229            if len(self.SandwichedCounters(x, y, dx, dy)) > 0:
230                return True
231        return False
232
233    def GetAllSandwichedCounters(self, x, y):
234        """ Is (x,y) a possible move (i.e. opponent counters are sandwiched between (x,y) and my counter in some direction)?
235        """
236        sandwiched = []
237        for (dx, dy) in self.AdjacentEnemyDirections(x, y):
238            sandwiched.extend(self.SandwichedCounters(x, y, dx, dy))
239        return sandwiched
240
241    def SandwichedCounters(self, x, y, dx, dy):
242        """ Return the coordinates of all opponent counters sandwiched between (x,y) and my counter.
243        """
244        x += dx
245        y += dy
246        sandwiched = []
247        while self.IsOnBoard(x, y) and self.board[x][y] == self.playerJustMoved:
248            sandwiched.append((x, y))
249            x += dx
250            y += dy
251        if self.IsOnBoard(x, y) and self.board[x][y] == 3 - self.playerJustMoved:
252            return sandwiched
253        else:
254            return []  # nothing sandwiched
255
256    def IsOnBoard(self, x, y):
257        return x >= 0 and x < self.size and y >= 0 and y < self.size
258
259    def GetResult(self, playerjm):
260        """ Get the game result from the viewpoint of playerjm. 
261        """
262        jmcount = len([(x, y) for x in range(self.size) for y in range(self.size) if self.board[x][y] == playerjm])
263        notjmcount = len(
264            [(x, y) for x in range(self.size) for y in range(self.size) if self.board[x][y] == 3 - playerjm])
265        if jmcount > notjmcount:
266            return 1.0
267        elif notjmcount > jmcount:
268            return 0.0
269        else:
270            return 0.5  # draw
271
272    def __repr__(self):
273        s = ""
274        for y in range(self.size - 1, -1, -1):
275            for x in range(self.size):
276                s += ".XO"[self.board[x][y]]
277            s += "\n"
278        return s
279
280
281class Node:
282    """ A node in the game tree. Note wins is always from the viewpoint of playerJustMoved.
283        Crashes if state not specified.
284    """
285
286    def __init__(self, move=None, parent=None, state=None):
287        self.move = move  # the move that got us to this node - "None" for the root node
288        self.parentNode = parent  # "None" for the root node
289        self.childNodes = []
290        self.wins = 0
291        self.visits = 0
292        self.untriedMoves = state.GetMoves()  # future child nodes
293        self.playerJustMoved = state.playerJustMoved  # the only part of the state that the Node needs later
294
295    def UCTSelectChild(self):
296        """ Use the UCB1 formula to select a child node. Often a constant UCTK is applied so we have
297            lambda c: c.wins/c.visits + UCTK * sqrt(2*log(self.visits)/c.visits to vary the amount of
298            exploration versus exploitation.
299        """
300        s = sorted(self.childNodes, key=lambda c: c.wins / c.visits + sqrt(2 * log(self.visits) / c.visits))[-1]
301        return s
302
303    def AddChild(self, m, s):
304        """ Remove m from untriedMoves and add a new child node for this move.
305            Return the added child node
306        """
307        n = Node(move=m, parent=self, state=s)
308        self.untriedMoves.remove(m)
309        self.childNodes.append(n)
310        return n
311
312    def Update(self, result):
313        """ Update this node - one additional visit and result additional wins. result must be from the viewpoint of playerJustmoved.
314        """
315        self.visits += 1
316        self.wins += result
317
318    def __repr__(self):
319        return "[M:" + str(self.move) + " W/V:" + str(self.wins) + "/" + str(self.visits) + " U:" + str(
320            self.untriedMoves) + "]"
321
322    def TreeToString(self, indent):
323        s = self.IndentString(indent) + str(self)
324        for c in self.childNodes:
325            s += c.TreeToString(indent + 1)
326        return s
327
328    def IndentString(self, indent):
329        s = "\n"
330        for i in range(1, indent + 1):
331            s += "| "
332        return s
333
334    def ChildrenToString(self):
335        s = ""
336        for c in self.childNodes:
337            s += str(c) + "\n"
338        return s
339
340
341def UCT(rootstate, itermax, verbose=False):
342    """ Conduct a UCT search for itermax iterations starting from rootstate.
343        Return the best move from the rootstate.
344        Assumes 2 alternating players (player 1 starts), with game results in the range [0.0, 1.0]."""
345
346    rootnode = Node(state=rootstate)
347
348    for i in range(itermax):
349        node = rootnode
350        state = rootstate.Clone()
351
352        # Select
353        while node.untriedMoves == [] and node.childNodes != []:  # node is fully expanded and non-terminal
354            node = node.UCTSelectChild()
355            state.DoMove(node.move)
356
357        # Expand
358        if node.untriedMoves != []:  # if we can expand (i.e. state/node is non-terminal)
359            m = random.choice(node.untriedMoves)
360            state.DoMove(m)
361            node = node.AddChild(m, state)  # add child and descend tree
362
363        # Rollout - this can often be made orders of magnitude quicker using a state.GetRandomMove() function
364        while state.GetMoves() != []:  # while state is non-terminal
365            state.DoMove(random.choice(state.GetMoves()))
366
367        # Backpropagate
368        while node != None:  # backpropagate from the expanded node and work back to the root node
369            node.Update(state.GetResult(
370                node.playerJustMoved))  # state is terminal. Update node with result from POV of node.playerJustMoved
371            node = node.parentNode
372
373    # Output some information about the tree - can be omitted
374    if (verbose):
375        print(rootnode.TreeToString(0))
376    else:
377        print(rootnode.ChildrenToString())
378
379    return sorted(rootnode.childNodes, key=lambda c: c.visits)[-1].move  # return the move that was most visited
380
381
382def UCTPlayGame():
383    """ Play a sample game between two UCT players where each player gets a different number 
384        of UCT iterations (= simulations = tree nodes).
385    """
386    # state = OthelloState(4) # uncomment to play Othello on a square board of the given size
387    state = OXOState() # uncomment to play OXO
388    # state = NimState(15)  # uncomment to play Nim with the given number of starting chips
389    while (state.GetMoves() != []):
390        print(str(state))
391        if state.playerJustMoved == 1:
392            m = UCT(rootstate=state, itermax=1000, verbose=False)  # play with values for itermax and verbose = True
393        else:
394            m = UCT(rootstate=state, itermax=100, verbose=False)
395        print("Best Move: " + str(m) + "\n")
396        state.DoMove(m)
397    if state.GetResult(state.playerJustMoved) == 1.0:
398        print("Player " + str(state.playerJustMoved) + " wins!")
399    elif state.GetResult(state.playerJustMoved) == 0.0:
400        print("Player " + str(3 - state.playerJustMoved) + " wins!")
401    else:
402        print("Nobody wins!")
403
404
405if __name__ == "__main__":
406    """ Play a single game to the end using UCT for both players. 
407    """
408    UCTPlayGame()
    
      #tech notes
    
  8353 words