Sancho GGP Player: June 2014

Tuesday, June 17, 2014

Sancho goes green

In this post, I’ll describe how Sancho gained a significant improvement in performance - by going green.

Reduce, reuse, recycle

As with many other cities around the world, the City of Edinburgh Council encourages its residents to reduce, reuse & recycle - in that order.

Reduce: Don’t create the rubbish in the first place. (Buy less, buy items with less packaging, etc.)
Reuse: Don’t throw things away after their first use. Use them again.
Recycle: If you can’t reuse it, recycle it.
Dispose: If all else fails, put it out with the garbage.

So, what does this have to do with Sancho & GGP? One of the first things that we noticed after getting graphing going is that we had a big garbage collection problem. If you look at a typical graph from a game of Checkers a couple of weeks ago, you’ll see what I mean. Every 3 seconds or so, the JVM was performing >400ms of garbage collection. The largest spike was 1620ms of garbage collection in 1 second. (The G1 garbage collector is multi-threaded.) If you look at the slope of the memory usage graph between collections, you’ll see that Sancho is regularly allocating in excess of 130MB/s - all of which is garbage collected almost immediately.

So we set about doing something about it.

Reduce

The first and biggest difference was made by not allocating objects that we didn’t strictly need.

When Sancho expanded a node in the MCTS tree, it would create a node for every child - even though it would only do an MCTS simulation from one of them. This was a massive waste and meant that, in a game with a (relatively small) branching factor of 8, only 12% of the fringe nodes had any useful information stored in them. Most of them would never be visited and would be discarded when the tree was pruned at the start of a new turn.
There was some old code lying around that allocated and initialized two new double[]s for every node in the path through the MCTS tree for every iteration. However, this array was used for precisely nothing!
Sancho does the back-propagation phase recursively. However, it’s using tail recursion which can (and should) be replaced by iteration. This is a slightly subtle one. Stack frames (needed for each recursive call) aren’t allocated on the heap and garbage collected. However, they do affect garbage collection. The garbage collector has to examine every stack frame from every thread (whilst all the programs threads have been halted) to find “GC roots”. So, the deeper the callstack, the longer the GC pause. (I haven’t implemented the fix for this yet, but it should be straightforward.)

Reuse

Several frequently called routines were allocating temporary local arrays for carrying out their processing. This produced lots of garbage. We sorted these out by statically allocating (*) the arrays and other objects used in these routines and then reusing them as required.

(*) Actually, we don’t use static allocation because we support running multiple instances of Sancho within the same process. However, we have some suitable per-Sancho-instance storage that is effectively static and is what we use for this.

Recycle

Another massive reduction in garbage was achieved by recycling objects that were no longer needed. Sancho has always limited its MCTS tree to 2,000,000 nodes. In order to do that, it has to throw old nodes away and create new ones. As well as the nodes, Sancho has objects representing edges, game states, scores, etc., etc., all of which were being thrown away and new ones created - another source of garbage.

The key to addressing this problem was recycling the objects via pooling. All our frequently used objects are allocated from a pool of the appropriate type of object (or are directly embedded in a pooled object in a 1-to-1 mapping and allocated only when the parent object is allocated). When Sancho is finished with a pooled object, it returns the object to the pool where it is “recycled” (i.e. has its members reset), ready to be handed out on the next allocation request.

This removes some of the advantage of Java in that it is once again necessary to keep track of allocated objects, being sure the return them to the pool when no longer needed. However, from a little bit of programming pain comes big performance gains.

Dispose

We’re still not completely garbage free. If you look at a graph of Sancho playing the same game (Checkers), you’ll see that it now allocates <3MB/s, down from >130MB/s. And instead of doing 2000ms of garbage collection in 5 bursts across ~15 seconds, it now interrupts processing just twice to do 2000ms of garbage collection in a whole 20-minute game (not counting collection done during meta-gaming which we haven’t attempted to optimize).

Sunday, June 8, 2014

Move Heuristics

As I mentioned in the previous post, heuristics can be based on a state, or on a move (typically in the context of a state). We have mostly concentrated on the former for a couple of reasons:

State heuristics are easier to derive by analysis in most cases, because, apart from a few exceptional cases, moves are not intrinsically good or bad in most games. Moves are only good or bad in the context of a state, and hence any analysis must take account of two parameters
For a player whose node structure is a graph rather than a tree (allowing explicitly for transpositions), as is the case with Sancho, multiple different moves may lead (from different parent nodes) to the same node, which makes it harder to incorporate the heuristic result

However, there is one class of move heuristic that is rather generic (works in a large number of games, and can be default enabled as part of general search). This is based on the observation that a move that is good in one state, is highly likely to be good in a similar state, and hence it is desirable to preferentially search paths for moves that are known to lead to good outcomes in similar states. The devil is in the detail of course, and defining what we mean by 'good', and especially what we mean by 'similar' is the main problem.

Because of the the second issue above, we cannot (easily) use the technique of pre-seeded node score values as the mechanic by which the heuristic operates (that is not suitable when we transition into an already expanded node, but we still want the move heuristic value to act as a selection differentiator from the particular parent we have selected through). Consequently, to enact move heuristics, we explicitly modify the UCB selection formula. Specifically we add a multiplier, which is applied to the exploration term of the UCB formula, to the edge representing the move. Each time we select through an edge we reduce the multiplier (thus the effect decays with increasing selection, and asymptotically the convergence will be the same as basic MCTS).

Techniques that select moves which have been seen to be good previously are quite common in the literature (various action history techniques like AMAF, RAVE and so on fall into this category - see the 'History heuristic' section of this paper). The problem is that they don't really work well in a generic GGP setting. This mainly stems form their implicit definition of 'similar' for states as 'not far away in the tree'. This works fine for games where a move just modifies the position of one or two pieces, or marks/un-marks some cells, but fails utterly in games where successor states may vary wildly from their parents (e.g. - CitTacEot, Pentago, Reversi, ...).

To address this deficiency we define a distance on states as the proportion of base propositions that differ between them, and 'similar' as below a certain small distance on that measure. To quickly identify similar states we define a locality-sensitive hash on states, which hashes a state down to a small number of bits (16 in Sancho currently) with the property that states differing by a small number of base propositions will hash to values within a small Hamming distance of one another. We then use this to form a closed hash, directly indexed by the locality-sensitive hash (hence no more than 16 bits or so is practical) to cache a small number (up to 8 currently) of node references for the states encountered in node expansion that hash to that value. When expanding a node we search all buckets within Hamming distance of 1 of the hash of the parent node (so 16 buckets) and, for each cached node reference found, accrue the average score for each of its children into a Move->weight map (weighting a contribution by the log of the number of visits). The result is a mapping from move to heuristic selection value. We simply take the top few (4 currently) and boost their selection amplifier by an amount that reflects the weight. Taking just the top few seems to produce better results, allowing the search to concentrate more narrowly.

Each time we select through a node it is re-cached, replacing the least visited node already cached in the same bucket (if we have filled that bucket already). This ensures that the cache always contain the most visited nodes, which are likely (due to the action of MCTS) to be the best.

This technique shows definite improvement to play in the Breakthrough family, but also somewhat in games like Pentago, without apparent harm to any games I have attempted to measure the impact on. Consequently (at least for now) it is unconditionally enabled in the current version of Sancho.

Sunday, June 1, 2014

Heuristics

Although Sancho is primarily an MCTS player, it is highly desirable to integrate the use of heuristics where they can be identified. MCTS is guaranteed to converge to the optimal solution for any game (any game that can be validly expressed in GDL anyway), but that convergence might be very slow, or uneven, depending on the game in question.

Good examples are:

Bidding TicTacToe_10coins, which converges very unevenly (taking the opening move, it first sees the high bids as most beneficial, but as it continues to search it eventually realizes that any bid over about 4 is actually a losing move!). MCTS scores the move in the wrong direction entirely for the high value moves, before eventually converging on 0.
Breakthrough, which converges very slowly, due to moves that are not near the end of the board requiring a huge number of playouts before statistical differences start to show up between them.

The main example I'll use in this post is Breakthrough, because it has an easily discoverable heuristic (piece capture). In the section that discusses possible future heuristic analysis I'll talk about one idea that might yield something that applies in Bidding TicTacToe, but currently that remains speculative, so I will spend less time on it.

Adding heuristics to a MCTS player presents two fairly independent problems, which I'll cover separately.

Identifying Heuristics

The first problem is the identification of suitable heuristics, by analysis rather than human specification (since this is GGP). The overall approach we take in Sancho, is to do this in two distinct phases during meta-gaming.

In the first phase we generate potential heuristics by static analysis, with the single very weak requirement that it must be calculable given a state (it needs to be cheap to calculate too, but that's not a hard requirement so much as a desirable property). In particular we impose no requirement, or even expectation that a candidate heuristic is at all likely to be useful in the current game. In principle generating more or less random heuristics at this stage would be fine, though we want to keep the total number of candidates down to something reasonably small (maybe a few 10s). Having such a weak requirement means we can use extremely cursory 'analysis' to generate candidates, as will become apparent shortly!

In the second phase we play a number (typically thousands to 10s of thousands) of random games, and during those simulations we track the heuristic values of states encountered. After the simulations are complete we then calculate the Pearson correlation coefficient between each candidate heuristic's observed values and the eventual score achieved in the simulated game instance in question.

Those heuristics that show correlation above a fixed threshold are then enabled, and will be used during play.

Experimentally we have a mobility heuristic (which is not enabled in the current version of Sancho) and a family of piece count heuristics (which will together generate a combined piece heuristic if correlation are found). The piece count candidates are extremely simplistic - approximately speaking, we just look for base propositions whose GDL sentence has an arity of at least 3 (2 dimensional board space + role) and maximum of 4 (might be a piece type separate from the others), one of which could conceivably encode role (domain of the variable has cardinality of multiple of role count). Once we find such a set of propositions we create a vector of state-masks (once member for each value of the could-be-role-encoding domain) with all the bits that correspond to the non-assumed-role-domain's values set. For Breakthrough this gives you masks for all possible positions of the black pawns, and of the white pawns. The heuristic value is then just the difference between the cardinalities of the role state masks ANDed with the current state (which is the material advantage/disadvantage). In principal (games with multiple types of pieces typically) there can be several candidate sets - we simply OR together those with positive correlation above the threshold to get one master piece-map - obviously this is crude and a weighted set would be better for games with differing piece values. However, since there aren't really any on Tiltyard as yet where that makes a worthwhile difference, that remains a future direction for enhancement.

There are many more possible candidate heuristics one could construct generically (mobility, focus, goal value for games where you get defined values in non-terminal states, ...), and I'll come back to that later when I talk about future directions.

Integrating heuristics into MCTS

The second problem is how to use heuristic values for states (or potentially moves, though that's a little different and I'll talk about that in a future post) in the context of MCTS search.

There are two obvious possible mechanisms (or strictly two families of mechanisms) to modify MCTS for heuristic signals. These have largely been explored in the literature (this is an excellent overview of known modifications to MCTS), and work by either:

Seeding the score for n node on creation (e.g. - by virtual playouts of the heuristic value); or
Modifying the UCT formula to include a heuristic term (usually with progressively lowered bias as the visit count increases). This is somewhat reliant on the convergence rate of the subtrees up-selected by the heuristic being reasonably fast, or else, even though the heuristic may point 'the right way', if the resulting score-signal is too small, noise will obliterate the benefit.

I spent a number of weeks around the start of this year experimenting with both approaches, and found that approach (2) had a significant issue with convergence rates in some games, due to the node score and selection rate becoming less well correlated. This means that choice of the best move to play cannot be determined just by looking at the node score vectors, or low-visit count nodes tend to be chosen and play becomes erratic. As the number of expansion cycles increases this becomes less of an issue, and ultimately convergence will be correct, but it was a practical issue for some of the slower games from Tiltyard.

As a result of this I settled on the first approach (though latterly we do perform some selection term processing too, although for a slightly different purpose, which I'll talk about in a future post). The benefit is that scores and selection remain strongly correlated, and it functions well in a very slow MCTS convergence environment, because even if all playouts see essentially the same score, the heuristic signal dilutes only very slowly.

The exact mechanism we use is to have each heuristic which is enabled (that is, those found to correlate positively during meta gaming) to produce a modified score vector, based on the known score vector for a reference state (canonically this should probably be the parent node's state, though we currently use that of the root [i.e. - current position at start of turn] for somewhat historical reasons) and the heuristic difference between the current state and the reference state. We also ask the heuristic for a 'strength' value (i.e. - how much does it think we should pay attention to it in the current state), and then write the computed score vector as the newly created node's score vector, and give it a virtual visit count proportional to the reported heuristic strength. We do NOT perform virtual back propagation, but instead modify the back propagation update routine, so that when a node is encountered whose visit count is higher than the selection traversal count for the edge leading to it (in that selection path), then the value propagated up from it is a blend of its score and the playout score, weighted by the proportion of visits that occurred along the path in question. This technique serves to back-propagate the seeded heuristic value (progressively less as more selections through it are made), and also serves to add information to back-propagation that arises from the graph-rather-than-tree nature of our MCTS structure (which arises from transpositions).

The strength value we use for the piece heuristic is simply inversely proportional to the size fo the current difference in material - i.e. - if we're miles ahead in material we prefer to not distort MCTS search as much to try to capture yet more; and conversely if we're miles behind then material probably isn't going to be our salvation anyway, so again better to emphasize undistorted MCTS.

As an example of the piece heuristic in action consider this game (it can be played through by using cursor left to go back a move and cursor right to move forward). With apologies to Richard (the unfortunate victim), I chose this game because it is a recent example of Sancho against a strong MCTS player operating only with MCTS (I believe anyway), and illustrates the problems MCTS has with slow-convergence games like this. What you'll note is that there is a strong tendency for MCTS players to want to simply advance their pawns, and without a notion of captures they tend to do so regardless of whether they can be captured, at least until about the 6th rank (at which point the edges of the tree become sufficiently clear, even with fairly large branching factors as here). The main impacts of having a heuristic that understands the notion of capture is not so much that it makes the player play 'well' in any deep sense, but rather that it enables it to avoid 'obvious' blunders, and to capitalize when its opponent makes those blunders. At the current state of the art for GGP, that is quite enough!

Future Directions

This is potentially a hugely fruitful area for further exploration. The main areas to be explored are:

Heuristics for moves rather than states (we already do some of this as of about a week ago, and this will be the subject of a future post)
Use of heuristics (move or state) in playouts to improve the (statistical) quality of playouts
Development of more candidate heuristics for use within the current framework

In particular I have a long list of candidate heuristics that might be employed in the current structure to try at some point:

Mobility. Andrew has already implemented and experimented with this a bit. In a few test games he found it to be useful, but mostly in games that already have the piece heuristic, and the combination did not seem to improve enough to justify use (that is, mobility was better than nothing, but it was beaten by the piece heuristic and the combined heuristics seemed to add little). For now it remains disabled, but it is likely that it will be re-visited at some point
Focus (flip side of mobility, probably doesn't need a separate heuristic, just use of negative correlations above a threshold to trigger use of minus the heuristic value)
GDL Goal value. Since some games define the goal value in all states, there will be some games where this correlates well. It's trivial to implement too.
Distance-to-win. There is a class of games where a win is triggered by one or more base propositions becoming true (breakthrough family, sheep&wolf as the wolf, ghost maze, ...). In those games one can use the dependency distance matrix I discussed in a previous post to see how close a state is to a winning state (or how close a move is to generating a winning state). In breakthrough (for example) summing this measure over the state would yield an aggregate how-advanced-up-the-board-you-are heuristic, which would tend to encourage exchanges in your half of the board (which necessarily involves opponent pieces that have expended several moves getting there, for pieces of yours with less invested movement), and discourage them in the opposing half. In games of breakthrough small between good MCTS players, where these exchanges take place tend to be the game-determining factor.
Mutual-support. Again, using the proposition distance matrix, in a game where base propositions can be tentatively assigned to a role [such as pieces] one can calculate the average distance of set base propositions (for a role) from each other. In games like breakthrough this would give a measure of how well pieces are supporting one another.
Opening book patterns. States visited in previous runs of the same game that are strongly correlated (or anti-correlated) with the eventual result could be saved in a persistently stored 'cache' and learnt over time (many plays of the same game).
Numeric quantity detection. It shouldn't be terribly hard to detect the presence of numbers in the state by static analysis (e.g. - number of coins held by each role in Bidding TicTacToe). These numbers directly provide a candidate heuristic to look for correlations of (hopefully this would show that holding a large number of coins in Bidding TicTacToe, and hence committing less in auctions, is a good idea).
State pattern matching. In principal patterns within a state could be learnt over many plays of the same game. I have in mind a genetic algorithm starting with random state bitmaps (of small cardinality). Playing the same game repeatedly (in learning mode against itself probably), it should be possible to train up useful patterns (think diagonal relation between pawns for example). This is more a far-future blue-sky possibility, that I don;t expect to realize in the near future!