Sancho GGP Player: Heuristics

Although Sancho is primarily an MCTS player, it is highly desirable to integrate the use of heuristics where they can be identified. MCTS is guaranteed to converge to the optimal solution for any game (any game that can be validly expressed in GDL anyway), but that convergence might be very slow, or uneven, depending on the game in question.

Good examples are:

Bidding TicTacToe_10coins, which converges very unevenly (taking the opening move, it first sees the high bids as most beneficial, but as it continues to search it eventually realizes that any bid over about 4 is actually a losing move!). MCTS scores the move in the wrong direction entirely for the high value moves, before eventually converging on 0.
Breakthrough, which converges very slowly, due to moves that are not near the end of the board requiring a huge number of playouts before statistical differences start to show up between them.

The main example I'll use in this post is Breakthrough, because it has an easily discoverable heuristic (piece capture). In the section that discusses possible future heuristic analysis I'll talk about one idea that might yield something that applies in Bidding TicTacToe, but currently that remains speculative, so I will spend less time on it.

Adding heuristics to a MCTS player presents two fairly independent problems, which I'll cover separately.

Identifying Heuristics

The first problem is the identification of suitable heuristics, by analysis rather than human specification (since this is GGP). The overall approach we take in Sancho, is to do this in two distinct phases during meta-gaming.

In the first phase we generate potential heuristics by static analysis, with the single very weak requirement that it must be calculable given a state (it needs to be cheap to calculate too, but that's not a hard requirement so much as a desirable property). In particular we impose no requirement, or even expectation that a candidate heuristic is at all likely to be useful in the current game. In principle generating more or less random heuristics at this stage would be fine, though we want to keep the total number of candidates down to something reasonably small (maybe a few 10s). Having such a weak requirement means we can use extremely cursory 'analysis' to generate candidates, as will become apparent shortly!

In the second phase we play a number (typically thousands to 10s of thousands) of random games, and during those simulations we track the heuristic values of states encountered. After the simulations are complete we then calculate the Pearson correlation coefficient between each candidate heuristic's observed values and the eventual score achieved in the simulated game instance in question.

Those heuristics that show correlation above a fixed threshold are then enabled, and will be used during play.

Experimentally we have a mobility heuristic (which is not enabled in the current version of Sancho) and a family of piece count heuristics (which will together generate a combined piece heuristic if correlation are found). The piece count candidates are extremely simplistic - approximately speaking, we just look for base propositions whose GDL sentence has an arity of at least 3 (2 dimensional board space + role) and maximum of 4 (might be a piece type separate from the others), one of which could conceivably encode role (domain of the variable has cardinality of multiple of role count). Once we find such a set of propositions we create a vector of state-masks (once member for each value of the could-be-role-encoding domain) with all the bits that correspond to the non-assumed-role-domain's values set. For Breakthrough this gives you masks for all possible positions of the black pawns, and of the white pawns. The heuristic value is then just the difference between the cardinalities of the role state masks ANDed with the current state (which is the material advantage/disadvantage). In principal (games with multiple types of pieces typically) there can be several candidate sets - we simply OR together those with positive correlation above the threshold to get one master piece-map - obviously this is crude and a weighted set would be better for games with differing piece values. However, since there aren't really any on Tiltyard as yet where that makes a worthwhile difference, that remains a future direction for enhancement.

There are many more possible candidate heuristics one could construct generically (mobility, focus, goal value for games where you get defined values in non-terminal states, ...), and I'll come back to that later when I talk about future directions.

Integrating heuristics into MCTS

The second problem is how to use heuristic values for states (or potentially moves, though that's a little different and I'll talk about that in a future post) in the context of MCTS search.

There are two obvious possible mechanisms (or strictly two families of mechanisms) to modify MCTS for heuristic signals. These have largely been explored in the literature (this is an excellent overview of known modifications to MCTS), and work by either:

Seeding the score for n node on creation (e.g. - by virtual playouts of the heuristic value); or
Modifying the UCT formula to include a heuristic term (usually with progressively lowered bias as the visit count increases). This is somewhat reliant on the convergence rate of the subtrees up-selected by the heuristic being reasonably fast, or else, even though the heuristic may point 'the right way', if the resulting score-signal is too small, noise will obliterate the benefit.

I spent a number of weeks around the start of this year experimenting with both approaches, and found that approach (2) had a significant issue with convergence rates in some games, due to the node score and selection rate becoming less well correlated. This means that choice of the best move to play cannot be determined just by looking at the node score vectors, or low-visit count nodes tend to be chosen and play becomes erratic. As the number of expansion cycles increases this becomes less of an issue, and ultimately convergence will be correct, but it was a practical issue for some of the slower games from Tiltyard.

As a result of this I settled on the first approach (though latterly we do perform some selection term processing too, although for a slightly different purpose, which I'll talk about in a future post). The benefit is that scores and selection remain strongly correlated, and it functions well in a very slow MCTS convergence environment, because even if all playouts see essentially the same score, the heuristic signal dilutes only very slowly.

The exact mechanism we use is to have each heuristic which is enabled (that is, those found to correlate positively during meta gaming) to produce a modified score vector, based on the known score vector for a reference state (canonically this should probably be the parent node's state, though we currently use that of the root [i.e. - current position at start of turn] for somewhat historical reasons) and the heuristic difference between the current state and the reference state. We also ask the heuristic for a 'strength' value (i.e. - how much does it think we should pay attention to it in the current state), and then write the computed score vector as the newly created node's score vector, and give it a virtual visit count proportional to the reported heuristic strength. We do NOT perform virtual back propagation, but instead modify the back propagation update routine, so that when a node is encountered whose visit count is higher than the selection traversal count for the edge leading to it (in that selection path), then the value propagated up from it is a blend of its score and the playout score, weighted by the proportion of visits that occurred along the path in question. This technique serves to back-propagate the seeded heuristic value (progressively less as more selections through it are made), and also serves to add information to back-propagation that arises from the graph-rather-than-tree nature of our MCTS structure (which arises from transpositions).

The strength value we use for the piece heuristic is simply inversely proportional to the size fo the current difference in material - i.e. - if we're miles ahead in material we prefer to not distort MCTS search as much to try to capture yet more; and conversely if we're miles behind then material probably isn't going to be our salvation anyway, so again better to emphasize undistorted MCTS.

As an example of the piece heuristic in action consider this game (it can be played through by using cursor left to go back a move and cursor right to move forward). With apologies to Richard (the unfortunate victim), I chose this game because it is a recent example of Sancho against a strong MCTS player operating only with MCTS (I believe anyway), and illustrates the problems MCTS has with slow-convergence games like this. What you'll note is that there is a strong tendency for MCTS players to want to simply advance their pawns, and without a notion of captures they tend to do so regardless of whether they can be captured, at least until about the 6th rank (at which point the edges of the tree become sufficiently clear, even with fairly large branching factors as here). The main impacts of having a heuristic that understands the notion of capture is not so much that it makes the player play 'well' in any deep sense, but rather that it enables it to avoid 'obvious' blunders, and to capitalize when its opponent makes those blunders. At the current state of the art for GGP, that is quite enough!

Future Directions

This is potentially a hugely fruitful area for further exploration. The main areas to be explored are:

Heuristics for moves rather than states (we already do some of this as of about a week ago, and this will be the subject of a future post)
Use of heuristics (move or state) in playouts to improve the (statistical) quality of playouts
Development of more candidate heuristics for use within the current framework

In particular I have a long list of candidate heuristics that might be employed in the current structure to try at some point:

Mobility. Andrew has already implemented and experimented with this a bit. In a few test games he found it to be useful, but mostly in games that already have the piece heuristic, and the combination did not seem to improve enough to justify use (that is, mobility was better than nothing, but it was beaten by the piece heuristic and the combined heuristics seemed to add little). For now it remains disabled, but it is likely that it will be re-visited at some point
Focus (flip side of mobility, probably doesn't need a separate heuristic, just use of negative correlations above a threshold to trigger use of minus the heuristic value)
GDL Goal value. Since some games define the goal value in all states, there will be some games where this correlates well. It's trivial to implement too.
Distance-to-win. There is a class of games where a win is triggered by one or more base propositions becoming true (breakthrough family, sheep&wolf as the wolf, ghost maze, ...). In those games one can use the dependency distance matrix I discussed in a previous post to see how close a state is to a winning state (or how close a move is to generating a winning state). In breakthrough (for example) summing this measure over the state would yield an aggregate how-advanced-up-the-board-you-are heuristic, which would tend to encourage exchanges in your half of the board (which necessarily involves opponent pieces that have expended several moves getting there, for pieces of yours with less invested movement), and discourage them in the opposing half. In games of breakthrough small between good MCTS players, where these exchanges take place tend to be the game-determining factor.
Mutual-support. Again, using the proposition distance matrix, in a game where base propositions can be tentatively assigned to a role [such as pieces] one can calculate the average distance of set base propositions (for a role) from each other. In games like breakthrough this would give a measure of how well pieces are supporting one another.
Opening book patterns. States visited in previous runs of the same game that are strongly correlated (or anti-correlated) with the eventual result could be saved in a persistently stored 'cache' and learnt over time (many plays of the same game).
Numeric quantity detection. It shouldn't be terribly hard to detect the presence of numbers in the state by static analysis (e.g. - number of coins held by each role in Bidding TicTacToe). These numbers directly provide a candidate heuristic to look for correlations of (hopefully this would show that holding a large number of coins in Bidding TicTacToe, and hence committing less in auctions, is a good idea).
State pattern matching. In principal patterns within a state could be learnt over many plays of the same game. I have in mind a genetic algorithm starting with random state bitmaps (of small cardinality). Playing the same game repeatedly (in learning mode against itself probably), it should be possible to train up useful patterns (think diagonal relation between pawns for example). This is more a far-future blue-sky possibility, that I don;t expect to realize in the near future!

7 comments:

Lev BJune 2, 2014 at 9:23 AM
How do you correlate the final score with the value of a heuristic at each game state? Do you consider each game state as a separate data point, or maybe average the heuristic values among all states of a given game?
Steve DraperJune 2, 2014 at 10:15 AM
It depends on the heuristic. The framework calls an interface 'Heuristic' which all concrete heuristics (PieceHeuristic, MobilityHeuristic) implement after each move and also after each game. Different heuristics can use the per-turn or game-end info in different ways. The mobility heuristic sums the mobility at each turn and at game end accrues that total to the heuristic correlation sampler (which is effectively considering the average mobility over the game). The piece heuristic currently just looks at the end-state piece count and accrues that to the sampler. Either way its correlation between the sampled values accrued at game ends (whether depending on intermediary states or not) and the end score that counts.
PhotoSynthJune 2, 2014 at 4:33 PM
About the opening book, while recognizing the game and using past very-game-specific knowledge is an interesting problem, wouldn't it be cheating in a GGP context? It would be a bit like using unlimited game start time rather than being limited by the start time set by the game server.
Steve DraperJune 2, 2014 at 5:51 PM
That debate comes up repeatedly. My view is that provided all learning is by the player program, without human direction, then it is acceptable. Indeed, not permitting it would be counter-productive, since the entire area of machine learning is of great interest in AI generally. I can see that a restricted 'no learning' space is still an interesting and valid domain, but personally I find the whole subject of learning from experience in an automated manner to be very compelling as an area of study. Competitions should ideally include novel games (not be totally dominated by them, unless special competitions, but include them) as a counter-balance. This also makes the subject of automated game generation an interesting potential topic.
Richard EmslieJune 5, 2014 at 3:20 AM
Thanks for the write up.
I do like being the victim! :) This is very telling http://www.ggp.org/view/tiltyard/matches/90d2b6b30b03ff3ab921b501122d4abde2160ad9/. Although I am only doing 1600 tree playouts a second.

I was going to ask about virtual visit counts. Wouldn't adding visits mean that it is less likely to be run - when you want it to be run? Using the visit count versus the selection traversal count to modify back-propagation sounds like a neat trick.
Steve DraperJune 5, 2014 at 5:36 AM
Richard - it does reduce selection in terms of the exploration term, but it pre-loads the exploitation score with an upward bias (for positive heuristic value), as once the other children get a few visits and start to 'catch up' in exploration terms (and at any statistically useful layer of the tree that will happen for the low counts involved) the exploitation effect becomes dominant. To compensate it **might** be worth biasing the selection itself up too (we mechanism for that, which I'v promised to talk about in a future post, and which is primarily used for a slightly different purpose, is much newer). At some point I may experiment with that.

Sunday, June 1, 2014

Heuristics

7 comments: