- State heuristics are easier to derive by analysis in most cases, because, apart from a few exceptional cases, moves are not intrinsically good or bad in most games. Moves are only good or bad in the context of a state, and hence any analysis must take account of two parameters
- For a player whose node structure is a graph rather than a tree (allowing explicitly for transpositions), as is the case with Sancho, multiple different moves may lead (from different parent nodes) to the same node, which makes it harder to incorporate the heuristic result
However, there is one class of move heuristic that is rather generic (works in a large number of games, and can be default enabled as part of general search). This is based on the observation that a move that is good in one state, is highly likely to be good in a similar state, and hence it is desirable to preferentially search paths for moves that are known to lead to good outcomes in similar states. The devil is in the detail of course, and defining what we mean by 'good', and especially what we mean by 'similar' is the main problem.
Because of the the second issue above, we cannot (easily) use the technique of pre-seeded node score values as the mechanic by which the heuristic operates (that is not suitable when we transition into an already expanded node, but we still want the move heuristic value to act as a selection differentiator from the particular parent we have selected through). Consequently, to enact move heuristics, we explicitly modify the UCB selection formula. Specifically we add a multiplier, which is applied to the exploration term of the UCB formula, to the edge representing the move. Each time we select through an edge we reduce the multiplier (thus the effect decays with increasing selection, and asymptotically the convergence will be the same as basic MCTS).
Techniques that select moves which have been seen to be good previously are quite common in the literature (various action history techniques like AMAF, RAVE and so on fall into this category - see the 'History heuristic' section of this paper). The problem is that they don't really work well in a generic GGP setting. This mainly stems form their implicit definition of 'similar' for states as 'not far away in the tree'. This works fine for games where a move just modifies the position of one or two pieces, or marks/un-marks some cells, but fails utterly in games where successor states may vary wildly from their parents (e.g. - CitTacEot, Pentago, Reversi, ...).
To address this deficiency we define a distance on states as the proportion of base propositions that differ between them, and 'similar' as below a certain small distance on that measure. To quickly identify similar states we define a locality-sensitive hash on states, which hashes a state down to a small number of bits (16 in Sancho currently) with the property that states differing by a small number of base propositions will hash to values within a small Hamming distance of one another. We then use this to form a closed hash, directly indexed by the locality-sensitive hash (hence no more than 16 bits or so is practical) to cache a small number (up to 8 currently) of node references for the states encountered in node expansion that hash to that value. When expanding a node we search all buckets within Hamming distance of 1 of the hash of the parent node (so 16 buckets) and, for each cached node reference found, accrue the average score for each of its children into a Move->weight map (weighting a contribution by the log of the number of visits). The result is a mapping from move to heuristic selection value. We simply take the top few (4 currently) and boost their selection amplifier by an amount that reflects the weight. Taking just the top few seems to produce better results, allowing the search to concentrate more narrowly.
Each time we select through a node it is re-cached, replacing the least visited node already cached in the same bucket (if we have filled that bucket already). This ensures that the cache always contain the most visited nodes, which are likely (due to the action of MCTS) to be the best.
This technique shows definite improvement to play in the Breakthrough family, but also somewhat in games like Pentago, without apparent harm to any games I have attempted to measure the impact on. Consequently (at least for now) it is unconditionally enabled in the current version of Sancho.
No comments:
Post a Comment