Acta Polytechnica doi:10.14311/AP.2014.54.0333 Acta Polytechnica 54(5):333–340, 2014 © Czech Technical University in Prague, 2014 available online at http://ojs.cvut.cz/ojs/index.php/ap ALTERNATIVE SELECTION FUNCTIONS FOR INFORMATION SET MONTE CARLO TREE SEARCH Viliam Lisý Agent Technology Center, Department of Computer Science, FEE, Czech Technical University in Prague correspondence: viliam.lisy@agents.fel.cvut.cz Abstract. We evaluate the performance of various selection methods for the Monte Carlo Tree Search algorithm in two-player zero-sum extensive-form games with imperfect information. We compare the standard Upper Confident Bounds applied to Trees (UCT) along with the less common Exponential Weights for Exploration and Exploitation (Exp3) and novel Regret matching (RM) selection in two distinct imperfect information games: Imperfect Information Goofspiel and Phantom Tic-Tac-Toe. We show that UCT after initial fast convergence towards a Nash equilibrium computes increasingly worse strategies after some point in time. This is not the case with Exp3 and RM, which also show superior performance in head-to-head matches. Keywords: Monte Carlo Tree Search, imperfect information game, selection function, Regret matching. 1. Introduction Monte Carlo Tree Search (MCTS) is a family of sample-based tree search algorithms that has recently led to a significant improvement in the quality of state- of-the-art solvers for perfect information problems, such as the game of Go [1] or domain independent planning [2]. The main idea of Monte Carlo Tree Search is to run a large number of randomized simu- lations of the problem and learn the best actions to choose from the experience. The earlier simulations are generally used to create statistics that help to guide the later simulations to more important parts of the search space and decide on the best action to take in the current state of the game. The core component of the algorithm determining which action to choose next and what statistics to collect is called a selection function. Inspired by the success of MCTS in perfect infor- mation games, the algorithm has recently also been adapted for imperfect information games [3–5]. Games with imperfect information are fundamentally more complicated than perfect information games, for sev- eral reasons. The most important complication is that the optimal strategies in imperfect information games may require the players to make randomized decisions. For example, in the well-known game of Rock-Paper-Scissors, none of the available actions can be considered optimal. Playing any of the actions all the time can always be exploited by the opponent, and the optimal strategy against a rational opponent is to play each action with the same probability. Another important complication is the strong inter-dependency between the strategies in different parts of the game. A player often does not know the exact state of the game during the match, and the probability of the game being in the individual possible states depends on the opponent’s strategy in previous decisions in the game, which can depend on the optimal strategy in any other decision in the game. Game theory provides means to deal with all these complications, but previous attempts to adapt MCTS to imperfect information games generally did not eval- uate the properties of the strategies from the perspec- tive of game theory. The algorithms were developed mainly on the basis of heuristics and analogies with the perfect information case. In this paper, we analyze various selection func- tions in Information Set Monte Carlo Tree Search (IS-MCTS) [3]. We show that the most commonly used selection function – UCT ([6]) – does not allow the algorithm to converge to good strategies, and even causes the strategies to get worse with more compu- tation time. We further evaluate two additional selec- tion functions. Exp3 [7], which has previously been used in MCTS mainly to handle simultaneous moves [3, 8, 9], and Regret Matching [10], previously evalu- ated in the context of simultaneous move games [9], but never used for MCTS in generic imperfect infor- mation games. In an imperfect information variant of the game of Goofspiel and Phantom Tic-Tac-Toe, we show that these alternative selection strategies allow IS-MCTS to converge closer to the Nash equilibrium strategy and perform better in mutual matches. The following section introduces the basic game- theoretic concepts necessary to describe the IS-MCTS algorithm along with the existing and novel selection functions in Section 3. Afterwards, we continue with an experimental evaluation in Section 4, and then we conclude the paper. 2. Definitions and Background We focus on domains that can be modelled as two- player zero-sum extensive-form imperfect-information games. 2.1. Extensive form games We adapt the definition of the Extensive-form game (EFG) from [11]. 333 http://dx.doi.org/10.14311/AP.2014.54.0333 http://ojs.cvut.cz/ojs/index.php/ap Viliam Lisý Acta Polytechnica 3 C 2 D c -1 E 4 F d A -2 G 1 H c 4 G 2 H d B a 1 G 2 H e -2 G -1 H f B 0 I 1 J g 2 I 3 J h A b Figure 1. Example of a zero-sum imperfect-informa- tion extensive-form game. Definition 1 (Extensive-form game). An extensive form game with imperfect information consists of: • N is the set of players including the nature player (c) that represents the dynamics of the game; • for each i ∈N , Ai is the set of actions available to player i; • H is the set of possible states of the game, each corresponding to a history of actions performed by all players; • Z ⊆H is the set of terminal game states or histories; • P : H \Z → N is a function assigning to each non-terminal state a player that selects an action in the state; • A : H\Z → 2Ai is a function assigning to each non-terminal game state the actions applicable by the acting player; • T : H × Ai → H ∪Z is the transition function realizing one move of the game; • ui : Z → R for each i ∈ N \ {c} are the utility functions of the players defined only on the terminal states of the game; • Ii for player i represents the player’s imperfect information about the game. It is a partition of Hi = {h ∈ H : P(h) = i} termed information sets. Each information set I ∈ Ii represents the set of histories that are indistinguishable for player i. Therefore, we naturally extend A(I) = A(h) for some h ∈ I. The game starts with the empty history ∅. In each history h, player P(h) selects an action a ∈ A(h). After the action is performed, the game proceeds to history h′ = T (h,a), often also denoted ha. The game ends when it reaches a terminal history h ∈Z. Each player i ∈ N receives the payoff ui(h). A zero-sum game is a two player game (|N \{c}| = 2), such that for each terminal history h ∈ Z, the reward for one player is a loss for the other player (ui(h) = −u−i(h)). Extensive form games can be represented by a game tree, such as the one in Figure 1. The set of histories H are the nodes in the tree and Z are the leaves. We denote the nodes where the first player selects an action (H1) by 4, and the nodes of the second player (minimizing the utility of the first player) by 5. In this example, the action sets of the players are A1 = {a,b, . . . ,h}, A2 = {A,B,. . . ,J}. We denote the information sets as the ellipses around the tree nodes. After the history h = aA player 4 decides about the next move from A(h) = {c,d}, but in the game, he has exactly the same information as if h′ = aB was the current state of the game. The numbers in the leaves denote the utility of the first player. A pure strategy in an extensive form game is a function that assigns to each information set I ∈ Ii of player i an action from A(I). Each pair of pure strategies then naturally defines the utility function as the expected value of playing this pair of strategies. The expectation is taken over the actions in the chance nodes, in which the action is selected based on a commonly known probability distribution. The set of mixed strategies of an extensive form game is defined as the set of all probability distributions over the pure strategies. In games where players do not forget the actions they performed, a theorem proposed by Khun [12] allows us to use the following equivalent, but much more compact, representation of the mixed strategies. A behavioral strategy assigns to each information set a probability distribution over the available actions. If it is not specified otherwise, we mean by a strategy in an extensive form game the behavioral strategy. We denote the set of all mixed strategies of player i ∈N by Σi. We term a vector of one strategy for each player σ ∈ Πi∈NΣi a strategy profile and we denote σi the strategy of player i and we denote σ−i the strategy of the opponent of i. ∆(Ai) is the set of all probability distributions over Ai. ui can be naturally extended to mixed strategies as the expectation over the pure strategies. Definition 2 (Best response). For strategy profile σ, we define a best response of player i to the opponent’s strategy σ−i as the strategy br(σ−i) ∈ arg max σ′ i ∈∆(Ai) ui(σ′i,σ−i). One of the most fundamental solution concepts in game theory is the Nash equilibrium [11]. Definition 3 (Nash equilibrium). A strategy profile σ∗ ∈ Σ is a Nash Equilibrium if ui(σi,σ∗−i) ≤ ui(σ ∗) ∀i ∈N , ∀σi ∈ Σi. In words, in a Nash equilibrium, each player plays the best response to the strategies of the other play- ers. In zero-sum games, a Nash equilibrium is a very appealing strategy to play. It prescribes a (generally randomized) strategy that is optimal in several as- pects. It is a strategy that gains the highest expected reward against its worst-case opponent, even if the opponent knows the strategy played by the player in advance. Moreover, in the zero-sum setting, even if the opponent does not play rationally, the strat- egy still guarantees at least the reward it would have 334 vol. 54 no. 5/2014 Selection Functions for IS-MCTS gained against the rational opponent. This guaranteed reward is termed the value of the game. The distance of a strategy from a Nash equilibrium can be measured in terms of exploitability (e.g., [13]). Definition 4 (Exploitability). The exploitability of strategy σi is expl(σi) = vi −ui(σi, br(σi)) where vi is the value of the game for player i. The exploitability of strategy profile σ is expl(σ) = ui(σi, br(σi)) −ui(br(σ−i),σi). In this paper, we compute the exploitability using the best response function described in [14]. 3. Information Set Monte Carlo Tree Search Information Set Monte Carlo Tree Search (IS-MCTS) is a Monte Carlo tree search variant for imperfect infor- mation games. Fundamentally very similar algorithms have previously been formulated in two different ways. The formulation that is easier to understand is pre- sented in [4]. The MCTS iterations are performed on the complete game tree of the extensive form games (i.e., all possible states on the game). However, the statistics for the selection algorithm, such as UCT, are collected for the whole information set. If any of the nodes in the information set is reached, the selec- tion algorithm is used to select the next move and, subsequently, the statistics stored by this algorithm are updated by the result of the simulation. In the tree expansion step, the authors in [4] suggest adding a single node to the extensive form tree of the game. A very similar algorithm is presented in [15], in [3] as “Multiple-Observer Information Set Monte Carlo tree search” and in [5] as “Multiple Monte Carlo Tree Search”. The pseudo-code for a single iteration of the algorithm is presented in Algorithm 1. It starts in the root node of the game tree and descends the game tree towards a terminal state. The tree nodes are stored in the memory and common statistics are stored for all nodes in each information set. If the function is called with a terminal state, it just returns its utility for the first player (line 1). If nature selects an action in the current node (line 2), it is selected from the commonly known nature distribution and the algorithm is called recursively on the resulting state (line 4). Otherwise, the node is added to memory if it is not already present (line 5). The statistics for the information set it belongs to are accessed (line 6) and used to make the decision about the action to select (line 8). This action is then executed on the game state, producing the following game state, which is used in a recursive call of the function (line 9). This process is continued until a state belonging to a new information set is reached (line 10). This is the end of the selection stage, and the expansion consists of creating new IS-MCTS(h) 1: if h is terminal (h ∈Z) then return u1(h) 2: if h is a chance node (P(h) = c) then 3: a = random chance action from Ac(h) 4: return IS-MCTS(T (h,a)) 5: if h not in memory then add h to memory 6: IS = information set for state h 7: if IS is not null then 8: a = select(IS) 9: v = IS -MCTS (T (h,a)) 10: else 11: add new IS for h to memory 12: a = select(IS) 13: v = simulate(T (h,a)) 14: update(IS,a,v) 15: return v Algorithm 1. Information Set Monte Carlo Tree Search (empty) statistics for the information set (line 11). The specific data structures depend on the selection function. At this point, an action is selected by the selection function and the simulation is executed to estimate the quality of the following position (lines 12–13). The information sets accessed during the iteration are updated by the result of the simulation (line 14) when returning from the recursion, and the next iteration can start. Iterations are repeated until a given time budget is spent. The functions select and update in the pseudo-code can be implemented by a suitable selection function, such as UCT (with the negative of the received value for the opponent), and simulate can be either com- pletely random, or a domain dependent simulation, as in perfect information MCTS. In the following section, we discuss a possible selection function to be used with the algorithm. During an actual match, the player using this algo- rithm does not always start iterations from the root state of the game, but it rather maintains the current information set in the form of a collection of all states that can be the current state of the game. After the player using IS-MCTS selects an action, it applies the action to all of these states to determine the possible following states. When the opponent executes some action in the game, it executes all of the opponent’s applicable actions in the current information set and keeps only the resulting states that generate the same observations for the player. In the search from an inner information set during a match, the player chooses for each iteration the initial state uniformly at random from all states in the current information set [3]. 3.1. Selection Functions 3.1.1. Upper Confidence Bounds The selection function that was suggested for IS- MCTS in both [4] and [3] is the same modification of 335 Viliam Lisý Acta Polytechnica Input: K – number of actions; C – exploration pa- rameter 1: ∀ini = 0, x̄i = 0 2: for n = 1, 2, . . . do 3: i = arg maxi x̄i + C √ 2 ln n ni 4: Use action i and receive reward r 5: x̄i = nix̄i+rni+1 6: ni = ni + 1 Algorithm 2. UCT: Upper Confidence Bounds algo- rithm for selection in Monte Carlo Tree Search. Input: K – number of actions; γ – exploration pa- rameter 1: ∀ix̂i = 0 2: for t = 1, 2, . . . do 3: ∀ipi = exp( γ K x̂i)∑ K j=1 exp( γ K x̂j ) 4: p′i = (1 −γ)pi + γ K 5: Use action It from distribution p′ and receive reward r 6: x̂It = x̂It + r p′ It Algorithm 3. Exp3: Exponential weights for Ex- ploration and Exploitation algorithm for selection in MCTS. UCB1 [16] that was successful in perfect information MCTS [6]. We present the algorithm in Algorithm 2 and further refer to it as UCT. The algorithm main- tains the mean of the rewards received for each action x̄i and the number of times each action has been used ni. It first uses each of the actions once (the term with zero in the denominator is defined as ∞) and then decides what action to use based on the size of the one-sided confidence interval on the reward computed based on the Cheroff-Hoeffding bounds (line 3). We follow the suggestion from [17] and break ties on line 3 randomly. The strategy output as the solution for the informa- tion set after all simulations are the action use times ni normalized to sum to one. 3.2. Exponential Weights for Exploration and Exploitation UCT is a successful selection function for perfect in- formation problems, but it has been shown to converge to an exploitable strategy in a simultaneous move game [18], which is a special case of imperfect infor- mation games. Therefore [3, 8] and [5] propose the use of an alternative selection function Exp3, which can be modified to guarantee convergence to the optimal solution in single-stage simultaneous move games. Exp3 stores the estimates of the cumulative reward of each action over all iterations, even in the case that the action was not selected. In the pseudo-code in Algorithm 3, we denote this value for action i by x̂i. It is initially set to 0 on line 1. In each iteration, a probability distribution p is created proportionally to Input: K – number of actions; γt – non-increasing sequence of real numbers 1: ∀i Ri = 0 2: for t = 1, 2, . . . do 3: ∀i R+i = max{0,Ri} 4: if ∑K j=1 R + j = 0 then 5: ∀i pi = 1/K 6: else 7: ∀i pi = R + i∑ K j=1 R + j 8: p′i = (1 −γ)pi + γ K 9: Use action It from distribution p′ and receive reward r 10: ∀i Ri = Ri −r 11: RIt = RIt + r p′ It Algorithm 4. RM: Regret matching variant for se- lection in MCTS. the exponential of these estimates. The distribution is combined with a uniform distribution with probability γ to ensure sufficient exploration of all actions (line 4). After an action is selected and the reward is received from the recursive call of IS-MCTS, the estimate for the performed action is updated using importance sampling (line 6). The reward is weighted by one over the probability of using the action, in order to reach the correct value in expectation. The strategy output as the solution for the infor- mation set after all simulations are performed is the mean of strategies p over all iterations. In the imple- mentation of the algorithm, we use the numerically more stable form of the equation in line 3 proposed in [3]. 3.3. Regret Matching Regret matching is a general procedure originally de- veloped for playing known general-sum matrix games in [10]. The algorithm computes, for each action in each step, the regret for not playing another fixed action every time the action has been played in the past. The action to be played in the next round is selected randomly with probability proportional to the positive portion of the regret for not playing the action. The regret matching procedure in [10] requires exact information about the utility values in the game matrix as well as the action selected by the opponent in each step. In [19], the authors relax these require- ments. Instead of computing the exact values for the regrets, the regrets are estimated in a similar way as the cumulative rewards in Exp3. As a result, the modified regret matching procedure can be used as the selection function in IS-MCTS. We present the algorithm in Algorithm 4. The algorithm stores the estimates of the regrets for not taking action i in all time steps in the past in variables Ri. In lines 3-7, it computes the strategy for the current time step proportionally to the positive part of the regrets. The uniform strategy is added similarly 336 vol. 54 no. 5/2014 Selection Functions for IS-MCTS to the case of Exp3 in line 8. This ensures exploration and causes the addition in line 11 to be bounded. Lines 10 and 11 update the cumulative regrets using importance sampling. The main computational advantage of this proce- dure over Exp3 is that it requires only simple division instead of computing the expensive exponential func- tion for each action in each iteration. When used in an MCTS algorithm, this allows substantially more iter- ations to be performed in the same time budget. The strategy output as the solution for the information set after all simulations are performed is the mean of strategies p over all iterations. 4. Experimental Evaluation This section first presents two imperfect information games that we use in an evaluation. Afterwards, we analyze the dependence of the speed of convergence of IS-MCTS and the eventual distance from a Nash equilibrium on the used selection function. Finally, we evaluate how these convergence properties translate to the quality of actual game playing. In the whole section, we use hand-tuned values of the exploration parameters C = 2 for UCT, γ = 0.1 for Exp3 and RM. All experiments were performed in a unified publicly available codebase [20]. 4.1. Experimental Domains Goofspiel. Goofspiel is a simultaneous move card game often used to evaluate AI algorithms. The game is played with three identical decks of cards. Each deck contains cards of values {0, . . . , (n− 1)} and belongs to one of the players, including nature. The deck for nature is shuffled at the beginning of the game. In each round, nature reveals the top card from its deck. Each player selects any of their remaining cards and places it face down on the table so that the opponent does not see the card. Afterwards, the cards are turned face up and the player with the higher card wins the card revealed by nature. The card is discarded in case of a draw. At the end, the player with the higher sum of nature cards wins the whole game. In the results, we use utilities 1/0/−1 for win/draw/loss and count a draw as half win half lose in the win rates. We use the imperfect information variant of this game introduced by Lanctot [21] for evaluation. It introduces two modifications. First, in each round, the players only learn who won or lost the round, but not the bid played by the opponent. Second, both players know that the cards in nature’s deck are sorted in decreasing order. Phantom Tic-Tac-Toe. Phantom Tic-Tac-Toe is a blind variant of the well-known game of Tic-Tac- Toe. The game is played on a 3 × 3 board, where two players (cross and circle) attempt to place 3 identical marks in a horizontal, vertical, or diagonal row to win the game. The player who achieves this goal first wins the game. In the blind variant, the players are unable to observe their opponent’s moves and each player only knows that the opponent made a move and it is her turn. If a player tries to place her mark on a square that is already occupied by an opponent’s mark, the player learns this information and can place the mark in some other square. The uncertainty in phantom Tic-Tac-Toe makes the game large (≈ 1010 nodes [22]). In addition, since one player can try several squares before a move is successful, the players do not necessarily alternate in making their moves. This rule makes the structure of the information sets rather complex, and since the player never learns how many attempts the opponent actually performed, a single information set can contain nodes at a different depth in the game tree. 4.2. Convergence to Nash Equilibrium First, we focus on the speed of convergence in a small variant of Goofspiel with 6 cards (0, 1, . . . , 5). We measure the ability of the algorithms to approximate a Nash equilibrium strategies of the complete game by the sum of exploitabilities of both players’ strategies (see Section 2). We compare the IS-MCTS algorithm with three different selection functions: UCT, Exp3, and RM. The results are the means of 20 runs of each algorithm. Due to the different selection and update functions, the algorithms differ in the number of iterations per second. RM is the fastest, with more than 5.9 × 104, while UCT with computing the square roots and ran- dom tie breaking has around 3.9 × 104 iterations per second, and Exp3 computing the exponential func- tions has around 3.4×104 iterations. Figure 2 presents the exploitability of the algorithms run from the root state for 30 minutes. Note that the x-scale is loga- rithmic. The exploitability of UCT starts decreasing fairly quickly, but after approximately 20 seconds of computation it starts increasing again. The lowest error achieved by UCT is 0.39, reached after 25 sec- onds of computation. The variants of the IS-MCTS algorithm with Exp3 and RM converge slower at the beginning, but eventually achieve smaller error than UCT. After 500 seconds of computation, the error of Exp3 is 0.41 and the error of RM is 0.27. The game of Phantom Tic-Tac-Toe has approxi- mately 1010 terminal states, which makes it difficult to compute the exploitability of the strategies in the whole game. Therefore, we initially focus on a simpler version of the game with the first move enforced to be to the square in the center of the board. The first player has only this action available, and the second player can also first play only this action, which re- veals the position of the first player’s mark and allows the second player to move again. The second move of the second player and all the following moves can then be any legal moves in Phantom Tic-Tac-Toe. The number of iterations per second in this re- stricted game is similar to the previous game. RM makes the most iterations (9.5 × 104), because the 337 Viliam Lisý Acta Polytechnica 0.1 s UCT EXP3 RM RAND UCT 62.9 (2.6) 55.6 (2.1) 62.7 (2.5) 84.0 (2.1) EXP3 74.5 (2.2) 62.8 (1.6) 74.8 (2.0) 88.0 (1.9) RM 70.6 (1.7) 60.0 (1.6) 73.1 (2.1) 87.8 (1.4) RAND 15.7 (2.2) 8.2 (1.6) 10.3 (1.8) 49.7 (3.0) 1 s UCT EXP3 RM RAND UCT 61.5 (2.7) 54.0 (2.1) 64.5 (2.5) 84.2 (2.1) EXP3 74.5 (2.2) 61.8 (1.5) 75.7 (2.0) 87.8 (1.9) RM 72.8 (2.3) 59.2 (1.7) 75.2 (2.1) 88.8 (1.9) RAND 14.9 (2.1) 8.7 (1.7) 10.6 (1.8) 48.7 (3.0) Table 1. Win rates of the row player against different algorithms in Imperfect Information Goofspiel with 6 cards with 0.1 (top) and 1 (bottom) second of computation per move. The number in brackets indicates the 95 % confidence interval. ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.5 1.0 1.5 2.0 0.1 0.5 1.0 10.0 100.0 1800.0 Time [s] E q u ili b ri u m d is ta n ce Algorithm ●UCT EXP3 RM Figure 2. Convergence in the game root in Imperfect Information Goofspiel with 6 cards. game initially has a higher number of applicable ac- tions and RM uses the simplest functions to compute the probability of selecting each action. UCT per- forms significantly fewer iterations (5.1 × 104), which is probably caused by the actions often having very similar values. UCT needs to iterate through the ac- tions multiple times to select an action. Exp3 is again the slowest, with 3.4 × 104 iterations per second. The convergence of the algorithms that are run from the root of the game is presented in Figure 3. UCT first quickly reduces the exploitability of the strategy and then gradually makes the strategy more exploitable after 10 seconds of computation. The min- imum exploitability achieved by UCT is 0.27. Shortly after 10 seconds of computation, initially the worst RM becomes the best algorithm and, after the full 30 minutes, it converges to exploitability of 0.13. Exp3 is clearly the worst algorithm between the first and the hundredth second. 4.3. Head to Head Matches After analyzing the convergence of the strategies com- puted by the sampling algorithms, we evaluate how this property translates to the actual performance of the algorithms in head-to-head matches. All pre- sented results are an average of 1000 matches. We first evaluate the small game used for computation of ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.5 1.0 1.5 2.0 0.1 0.5 1.0 10.0 100.0 1800.0 Time [s] E q u ili b ri u m d is ta n ce Algorithm ●UCT EXP3 RM Figure 3. Convergence and game playing comparison of different algorithms on Phantom Tic-Tac-Toe, in which the first move of each player is forced to be to the middle square. the exploitability from the previous section to see the connection to the exploitabilities, and then we focus on substantially larger games to evaluate practical applicability. Table 1 presents the win rates of the algorithms in mutual matches in Imperfect Information Goofspiel with 6 cards per deck. The table on the top presents the results with 0.1 second per move and the table on the bottom presents the results with 1 second per move, but they are very similar. The first important observation is that even though the game is symmetric and the first player does not have any advantage, all IS-MCTS variants perform much better as the first player (rows). The reason is the asymmetry of the game model in the form of EFG. Even though in reality the players choose an action simultaneously, the game models this fact as a sequential decision with hidden information. As a result, the second player has substantially larger information sets than the first player. If the search is run from a larger information set, it is generally more important to have a good approximation of the probability of the individual states being the actual state of the game. The second player is at a large disadvantage here. IS-MCTS with Exp3 wins the most from the first 338 vol. 54 no. 5/2014 Selection Functions for IS-MCTS 1 s UCT EXP3 RM RAND UCT 70.0 (2.8) 67.7 (2.9) 61.0 (3.0) 90.6 (1.8) EXP3 79.5 (2.5) 63.2 (2.8) 66.4 (2.9) 95.6 (1.2) RM 73.8 (2.7) 68.2 (2.8) 67.5 (2.9) 92.8 (1.5) RAND 6.2 (1.5) 4.0 (1.2) 4.9 (1.3) 49.0 (3.0) Table 2. Win rate of different algorithms on Imperfect Information Goofspiel with 13 cards. The number in brackets indicates the 95% confidence interval. 0.1 s UCT EXP3 RM RAND UCT 85.9 (1.7) 84.8 (1.7) 82.3 (1.8) 95.6 (1.1) EXP3 87.2 (1.6) 88.1 (1.5) 85.3 (1.6) 96.6 (1.0) RM 87.7 (1.6) 88.1 (1.4) 85.2 (1.5) 96.0 (1.0) RAND 50.5 (3.0) 46.1 (3.0) 48.0 (3.0) 71.4 (2.7) 1 s UCT EXP3 RM RAND UCT 86.5 (1.8) 86.3 (1.7) 84.0 (1.8) 95.2 (1.1) EXP3 88.5 (1.6) 88.5 (1.5) 87.0 (1.6) 96.2 (1.0) RM 90.0 (1.4) 87.4 (1.5) 85.2 (1.5) 96.3 (1.0) RAND 52.0 (3.0) 48.9 (3.0) 46.2 (2.9) 70.9 (2.7) Table 3. Win rate of different algorithms on full Phantom Tic-Tac-Toe. The number in brackets indicates the 95% confidence interval. position and loses the least from the second position. This is a surprising result, as Exp3 has the slowest convergence. Apparently, it quickly reaches a strategy that is exploitable, but performs well against the other opponents in the tournament. The game playing performance of the algorithms in the large game of Imperfect Information Goofspiel with 13 cards in each deck and one second of com- putation per move is presented in Table 2. In the large game, the performance of the IS-MCTS variants is similar to the performance on the smaller version. Exp3 wins significantly the most against UCT, but RM loses less against UCT and itself from the second position. This makes it hard to select the better se- lection function between Exp3 and RM, but both of them are clearly better than UCT in this game. Since all the algorithms seem to perform very well even in the complete game of Phantom Tic-Tac-Toe, we do not present the evaluation on the smaller game with the enforced first move here. We assume that the differences between the algorithms would be even smaller there. The results in Table 3 show an even larger imbalance between the results achieved from the first and the second position in the game. This time, it is not caused only by the disadvantage of larger information sets, but also by the fact that the game is not balanced on its own. When both players play the optimal strategy, the player who moves first wins 83% of the games. We computed this values by the double-oracle algorithm [14] implemented in our framework. With 0.1 second per move (top), the performance of RM and Exp3 is practically identical from the first position, but RM loses less on the second position, which is consistent with its generally lower exploitabil- ity over the experiments. UCT generally performs the worst. With 1 second per move, the performance of all algorithms is more similar to each other. RM still wins most often and loses least often against UCT, but the mutual matches with Exp3 are more balanced. Either of the algorithms wins 87% of matches from the first position against each other. However, when the algorithms play against themselves, RM loses less from the second position, which makes it the more suitable algorithm for this setting. 5. Conclusions We have studied the influence of selection functions on the performance of Monte Carlo Tree Search in imperfect information games. We have evaluated the most standard UCT selection along with the less com- mon Exp3 and novel Regret Matching selection in a unified framework on two different games. To the best of our knowledge, this is the first direct comparison of the effect of selection functions on the performance of MCTS in imperfect information games. We show that none of the proposed selection func- tions allows the algorithm to converge very close to the Nash equilibrium, and even after 30 minutes of computation (approximately 108 iterations), the dis- tance from an equilibrium was still larger than 0.1. Consistently in both evaluation domains, the quality of the strategy produced with UCT first decreased and then started to increase again, showing that IS- MCTS with UCT does not have the desirable anytime property that more computation time yields better results. This is also the case with Exp3 at the very 339 Viliam Lisý Acta Polytechnica beginning, but the novel RM selection monotonously converges towards an equilibrium in our experiments. The superior performance of RM selection was also confirmed in head-to-head matches among the algo- rithms. In both domains used for evaluation, UCT performed the worst from the evaluated selection func- tions. RM was always among the best, but in some cases was slightly outperformed by Exp3. The good performance of Exp3 in the matches was not supported by low exploitability in the convergence experiments, which indicates that even though the algorithm per- formed well against the tested opponents, there are other opponents that are likely to perform much better against this algorithm, but not against RM. Acknowledgements This work was supported by the Czech Science Foundation (grant P202/12/2054). Access to the computing and stor- age facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum, provided under the “Projects of Large Infrastructure for Research, Development, and Innovations” programme (LM2010005) is highly appreciated. References [1] S. Gelly, D. Silver. Monte-carlo tree search and rapid action value estimation in computer go. Artificial Intelligence 175(11):1856–1875, 2011. doi:10.1016/j.artint.2011.03.007. [2] T. Keller, P. Eyerich. Prost: Probabilistic planning based on uct. In ICAPS. 2012. [3] P. I. Cowling, E. J. Powley, D. Whitehouse. Information set monte carlo tree search. Computational Intelligence and AI in Games, IEEE Transactions on 4(2):120–143, 2012. doi:10.1109/TCIAIG.2012.2200894. [4] M. Ponsen, S. de Jong, M. Lanctot. Computing approximate Nash equilibria and robust best-responses using sampling. Journal of Artificial Intelligence Research 42:575–605, 2011. doi:10.1613/jair.3402. [5] D. Auger. Multiple tree for partially observable monte-carlo tree search. In Applications of Evolutionary Computation, pp. 53–62. Springer, 2011. [6] L. Kocsis, C. Szepesvári. Bandit based monte-carlo planning. In J. Frnkranz, T. Scheffer, M. Spiliopoulou (eds.), Machine Learning: ECML 2006, vol. 4212 of Lecture Notes in Computer Science, pp. 282–293. Springer Berlin Heidelberg, 2006. doi:10.1007/11871842_29. [7] P. Auer, N. Cesa-Bianchi, Y. Freund, R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J Comput 32(1):48–77, 2003. doi:10.1137/S0097539701398375. [8] O. Teytaud, S. Flory. Upper confidence trees with short term partial information. In Applications of Evolutionary Computation, vol. 6624 of Lecture Notes in Computer Science, pp. 153–162. Springer Berlin Heidelberg, 2011. doi:10.1007/978-3-642-20525-5_16. [9] M. Lanctot, V. Lisý, M. H. M. Winands. Monte Carlo tree search in simultaneous move games with applications to Goofspiel. In Computer Games Workshop at IJCAI 2013, vol. 408 of Communications in Computer and Information Science (CCIS), pp. 28–43. Springer, 2014. [10] S. Hart, A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica 68(5):1127–1150, 2000. doi:10.1111/1468-0262.00153. [11] M. Osborne, A. Rubinstein. A course in game theory. MIT press, 1994. [12] H. W. Kuhn. Extensive games and the problem of information. Contributions to the Theory of Games 2(28):193–216, 1953. [13] M. Johanson, K. Waugh, M. Bowling, M. Zinkevich. Accelerating best response calculation in large extensive games. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume One, IJCAI’11, pp. 258–265. AAAI Press, 2011. doi:10.5591/978-1-57735-516-8/IJCAI11-054. [14] B. Bosansky, C. Kiekintveld, V. Lisy, et al. Double-oracle Algorithm for Computing an Exact Nash Equilibrium in Zero-sum Extensive-form Games. In Proceedings of International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 335–342. 2013. [15] V. Lisy, R. Pibil, J. Stiborek, et al. Game-theoretic approach to adversarial plan recognition. In Proceedings of the 20th European Conference on Artificial Intelligence (ECAI), pp. 546–551. 2012. doi:10.3233/978-1-61499-098-7-546. [16] P. Auer, N. Cesa-Bianchi, P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3):235–256, 2002. doi:10.1023/A:1013689704352. [17] C. B. Browne, E. Powley, D. Whitehouse, et al. A survey of monte carlo tree search methods. Computational Intelligence and AI in Games, IEEE Transactions on 4(1):1–43, 2012. doi:10.1109/TCIAIG.2012.2186810. [18] M. Shafiei, N. Sturtevant, J. Schaeffer. Comparing UCT versus CFR in Simultaneous Games. In IJCAI Workshop on General Game Playing. 2009. [19] S. Hart, A. Mas-Colell. A Reinforcement Procedure Leading to Correlated Equilibrium. Springer Berlin Heidelberg, 2001. doi:10.1007/978-3-662-04623-4_12. [20] Agent Technology Center. Computation game theory, 2014. http://agents.felk.cvut.cz/topics/ Computational_game_theory [2014-09-29]. [21] M. Lanctot. Monte Carlo Sampling and Regret Minimization for Equilibrium Computation and Decision Making in Large Extensive-Form Games. Ph.D. thesis, University of Alberta, 2013. [22] M. Lanctot, R. Gibson, N. Burch, et al. No-Regret Learning in Extensive-Form Games with Imperfect Recall. In ICML, pp. 1–21. 2012. arXiv:1205.0622v1. 340 http://dx.doi.org/10.1016/j.artint.2011.03.007 http://dx.doi.org/10.1109/TCIAIG.2012.2200894 http://dx.doi.org/10.1613/jair.3402 http://dx.doi.org/10.1007/11871842_29 http://dx.doi.org/10.1137/S0097539701398375 http://dx.doi.org/10.1007/978-3-642-20525-5_16 http://dx.doi.org/10.1111/1468-0262.00153 http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-054 http://dx.doi.org/10.3233/978-1-61499-098-7-546 http://dx.doi.org/10.1023/A:1013689704352 http://dx.doi.org/10.1109/TCIAIG.2012.2186810 http://dx.doi.org/10.1007/978-3-662-04623-4_12 http://agents.felk.cvut.cz/topics/Computational_game_theory http://agents.felk.cvut.cz/topics/Computational_game_theory arXiv:1205.0622v1 Acta Polytechnica 54(5):333–340, 2014 1 Introduction 2 Definitions and Background 2.1 Extensive form games 3 Information Set Monte Carlo Tree Search 3.1 Selection Functions 3.1.1 Upper Confidence Bounds 3.2 Exponential Weights for Exploration and Exploitation 3.3 Regret Matching 4 Experimental Evaluation 4.1 Experimental Domains 4.2 Convergence to Nash Equilibrium 4.3 Head to Head Matches 5 Conclusions Acknowledgements References