Tuesday, May 27, 2014

A picture is worth a thousand words

I have spent the last little while sorting out Sancho's logging along with the recording and visualization of statistical information.  No, it isn't the most thrilling thing to do, but I'm convinced that it will be worth it.

Up until now, Sancho has had an assorted collection of System.out.println calls with the odd System.err.println for good measure.  We even dumped occasional performance figures (number of MCTS iterations in the last turn, thread utilization, etc.) in amongst all the other output.  That's all very well, but soon we discovered that you can't see the wood for the trees.

So, if you're still developing a player, I'd recommend that now's the time to get to grips with logging.  The sooner you do it, the more you'll benefit from it.  And whilst I'm offering advice, here are the tools that I've used - all of which are great.

Log4j 2

Log4j 2 is an excellent logging framework that's simple to integrate and easy to configure to suit your needs.  This was an easy decision for me because I use Log4j (the original) at work.  Log4j 2 has sorted out all sorts of things that weren't quite right in first edition and it's a great product.

Sure, it was a bit tedious replacing all the println calls with LOGGER.debug/info/warn/error, but I can now trivially, for example, print info level logs and higher to the console in a concise format and debug level logs and higher in verbose format to a per-match log file (which means we'll always have the logs from every game we've played - avoiding the 'doh' moment when you realise that you failed to save off the console output and have now closed the window).  Statistics information is written to its own file in machine-readable format.

Highcharts

Highcharts is a super javascript graphing library.  I've dabbled with this before, although not very much.  However, it's really easy to pick up and has great documentation.

By graphing our statistical output, Steve & I have already spotted a couple of significant issues which we'd never have seen otherwise.  On one of my boxes, something is causing performance to regularly fall off a cliff.  (I suspect page faulting, but haven't investigated yet.)  On Steve's box, he has seen excessive garbage collection which we hope to address with additional object pooling.

Tiltyard log upload

If specified in a player's registration, Tiltyard will hit a URL after each match.  Whatever is returned, Tiltyard stores in the match database.  Then, when you look at the detailed record of a game on Tiltyard, your player will then have a little graph icon next to it.  Clicking on that will show the uploaded information.  Furthermore, if you add a log visualization URL to your player registration, you can produce your own custom viewer.

(I'm still having a few teething troubles with log uploads, but expect to have them available for all Sancho's matches soon.)

Sample

So still very much a work in progress, but here's that promised picture.  (And, once I've sorted out the Tiltyard wrinkles, just take a look at any of the recent matches Sancho has played in.)



Sunday, May 18, 2014

Use of Meta-gaming time

When a GGP server starts a new game and provides the GDL to the players, it then gives a grace period, known as meta-gaming time, to those players for them to perform any pre-game analysis they might wish to do, ahead of starting the actual move clock.  Typically the allotted meta-gaming time will be 3-4 times the per-move time.

During the meta-gaming time players must at least set up any machinery they will need to process the moves.  Typically this at least involves building a suitable state machine, and in Sancho's case that involves building and optimizing a propnet, as described in previous posts.

However, meta-gaming time is an opportunity to perform considerably more analysis, and use it to tune the way you play the game in any number of ways.  Sancho goes through a number of distinct stages during meta-gaming, which are described in what follows.

Build and optimize the propnet - this has been discussed previously

Static dependency analysis

Once the propnet has been constructed, we then analyse it to determine 'dependency distances' between propositions.  We define the dependency distance between proposition A and proposition B as the number of transitions that must be passed through, traversing backward the input graph of A, before B is encountered.  Obviously this can be infinite, and the analysis works (at least loosely speaking) by using dynamic programming to (in essence) prevent any graph traversal having to cross the same edge twice.  For the purposes of this analysis 'legal' props are considered to be virtual inputs to their corresponding 'does' props.  The result is (logically - we currently do not actually represent it this way, though we may move to do so at some point) a matrix of distances, with the proposition of the network being the row and column indexes.  The entry values in this matrix will be infinity or an integer less than the network size.

These dependencies can be useful in many subsequent analyses (some of which I will talk about below), but as a simple illustration consider the game of 'Breakthrough'.  The game terminates when one player gets one of their pawns to the opposite side of the board from where they started.  Considering the black role, whose pawns start on rows 1 and 2 of the board, this is equivalent of getting a pawn to the 8th row.  If you look at the dependency distance matrix row for dependencies feeding the 'goal black 100' proposition (i.e. - a black win) what you will see is that the 8 base propositions corresponding to a black pawn on row 8 have a distance of 0, those on row 7 a distance of 1, and so on.  Similarly moves of the form 'move x 7 y 8' will be at distance 1, 'move x 6 y 7' at distance 2, and so on.  This leads to a natural candidate closeness-to-goal heuristic (which can be defined on states or moves), which is currently an active area of ongoing experimentation.

Factorization

Some games factorize into 2 or more independent sub-games (or in some cases semi-independent, which is a case I'll return to shortly).  A canonical example is 'connect4 simultaneous', which is just 2 games of connect4 where each role plays alternately on the two boards (so in any given turn each role plays on one board and the boards alternate).  Without factorization this has an approximate branching factor of 16 (neglecting full columns), so to explore the tree to depth N fully would require roughly 16^N states to be searched.  However, if you actually treat it as 2 games, and search the two game trees independently you need only search 2*(8^N) nodes to completely search to depth N.
The ratio between these two is: 2^(N-1), so that's an exponential reduction of search effort in depth searched!

We can use the dependency distances calculated during dependency analysis to try to find factors in games.  Basically this involves looking at the propositions which disjunctive factors of the goal and terminal propositions are dependent on.  Taking the connect4 simultaneous example, the win goals (lines of 4 on either board) will boil down to an OR (disjunctive factor) of conditions which are each dependent directly (distance 0) on 4 base propositions (each instance of a line of four).  In turn those base propositions turn out to be dependent (in this game) on the base propositions for the same column but with a lower row number (at greater distance), and to have no dependency at all on other columns.  You then construct equivalence sets of propositions, such that any two propositions that are dependent on any other proposition which feed into the same disjunctive factor (OR) in the goal (or terminal) calculation then they belong to the same equivalence set.

With some fiddling for annoyingly written GDL (take a look at the way the terminal condition for C4 simultaneous is written and you'll probably see what I mean), and for 'control propositions' which are independent of moves, you wind up with either all propositions being in the same set (game is not factorizable), or with multiple sets, which are the factors.  In C4 simultaneous you wind up with all the base props and moves for one board in one set, and those for the other in another.

At this point there are still two different cases to consider, which highlight the difference between 'truly factorizable' games, like connect4 simultaneous', and semi-factorizable games, of which 'Breakthrough w/walls' is the only example on Tiltyard that I am aware of.  The difference is that in the pseudo-factorizable games you have more than one choice of move on a single turn in multiple factors.  This means that if you search the factors independently you have an issue with deciding how to use the result to choose an overall move for the full game.  This can be resolved by inserting pseudo-noop moves into the factors and weighing the relative score of the best move in one factor against the cost of having to play the pseudo-noop in the other(s).

Sancho plays the following games in a successfully factorized manner (at least these are the ones I am aware of and have specifically tested):

  • Connect 4 simultaneous
  • Dual Connect 4 (Stanford repository)
  • Joint Connect 4 (Stanford repository)
  • Chinook
  • Breakthrough w/walls
  • Dual Hamilton (Stanford repository)
  • Joint Button and Lights (Stanford repository)
Note - the 'multiple xxx' games on the Stanford repository are not really factorized in a meaningful sense.  It's more that they have a lot of ignorable junk in their generated base propositions and moves, which would pertain to factors if it weren't for the fact that you can eliminate them easily by simpler techniques.


Generation of candidate heuristics

After the factorization analysis is complete, we then generate candidate heuristics.  I'm not going to go into detail about this in this post (there will be posts on heuristics later), but the key points are:

  • Static analysis is used to produce possible heuristics for the game
  • No attempt is made to determine (at this point) if they are likely to be any real use!  For example, one might throw in focus and mobility heuristics here, since they are easily derivable form the GDL and/or propnet
  • Crucially (for some games) one such heuristic is a measure that looks like it might reflect piece count in the game (and I stress 'might', we do no clever analysis to verify that it does, just a basic arity and set-size test that test that base propositions could be piece locations on a board of at least 2 dimensions)


Game simulation to gather statistics

We then spend half the remaining meta-gaming time playing random games (playouts from the initial state), and gathering statistics, such as:

  • Distribution of game length
  • Average branching factor
  • Presence of simultaneous moves (easily determined statically, but for historical reasons we do it dynamically)
  • Effectiveness of 'greedy' playout processing wherein the playouts are not totally random, but actively perform a 1-level search at each step to find/avoid forced/unforced win/losses (this makes playouts much more expensive, and so we get fewer, but in some games it is highly effective at improving the convergence rate of MCTS)
  • Correlation of heuristic signals from each of the candidate heuristics with actual outcomes


Tuning of search parameters

Based on the results of simulation we then tune a number of search parameters, which we will use during MCTS expansion as we play the game.  Examples of things tuned are:

  • Enablement of greedy playouts (typically this costs up to an order of magnitude to the simulation rate, so we only enable it where it was found to be very effective relative to its cost, which is itself largely dependent ion the game's branching factor)
  • Exploration bias - i.e. - how much weight to give the UCB exploration term in relation to its exploitation term.  Note that this can have a very significant impact indeed on play quality, and if you seek to tune anything (even to find a good fixed average for use in all games) this is the one to start with.
  • How to perform node expansion - specifically a large cost in MCTS node expansion is evaluating the child states for terminality.  It is possible to structure things to defer this until the child in question is itself expanded, but this has a cost in convergence rate (if you propagate the consequences of terminality upwards assuming rational play), and is usually a bad idea.  However, if you know the game is unlikely to terminate at a certain node, based on a simplistic analysis of things like the game length distribution, you can decide whether or not to perform this check dynamically, depending on the node depth in the game tree.
  • How to handle 'complete' nodes (that is to say, those whose value is known with certainty, of which terminal nodes are the canonical example).  In particular these need to be handled differently (at least if your tree edges represent individual role decisions as ours do, rather than joint moves) in simultaneous move games.


Crystallization of heuristics

Based on the correlation of each candidate heuristic with the simulated outcomes sampled above, we then enable those heuristics which appear to work in the game in question.  That set will then be used during move search.  In particular we are able to identify piece capture in games like checkers and breakthrough in this way, and up-weight their selection during MCTS expansion.  If you've ever watched a pure MCTS player play Breakthrough and wondered why it quite frequently seems to just suicide its pieces when they are attacked, you'll appreciate why this is effective.

Make an early start on first turn search

If we have any meta-gaming remaining after all of the above, we start the game search thread so it can be getting on with considering the first move.  While it is doing that we yield on the meta-game request thread until a couple of seconds before the timeout, which ensures that the server does not curtail meta-gaming early (after all, with most opponents we are productively searching at this point, and they are twiddling their thumbs, so the longer we can prolong this the better!)

Summary

Your meta-gaming time is a valuable resource.  Use it!

Saturday, May 17, 2014

Multi-threading

In Steve's overview of Sancho, he mentioned that we have "an efficient infrastructure for parallelization across threads".  This is an area that I've done a fair bit of work on, so it seemed like a good topic for my first post.

Sancho uses several types of threads.
  1. Those threads started for us by ggp-base.  This includes the thread on which we're asked to submit moves (via the stateMachineSelectMove call).
  2. A single game searcher thread.
  3. Several simulation threads.
The first group of threads is largely uninteresting and everybody who uses ggp-base will get them automatically.  They deal with network communications.

Game searcher thread

The game searcher thread is the primary thread responsible for expanding the game tree(1).  In order to avoid the necessity of locking, this is the only thread that is permitted to write to the MCTS "tree".  This is another example of the single-writer principle which plays a significant roll in avoid time wasted through synchronization.

The game searcher thread is started during meta-gaming and then runs continuously throughout the game.  This means that Sancho can continue to search the game tree at all times, including (a) during the normal time for considering moves, (b) after submitting its move but whilst opponents are still thinking and (c) once all players have submitted their moves but before the server has asked for the next move (which can be a very considerable interval on the Tiltyard, especially when busy).

The game searcher thread is responsible for the MCTS "select" and "expand" phases as well as the "back-propagation" phase because all of these phases are intimately involved with the MCTS tree.  However, the "simulate" phase of MCTS has no use of the game tree at all and can be performed independently.


Simulation threads

The simulation threads(2) are responsible for the "simulate" phase of MCTS.  Sancho uses a pool of these threads, each running completely independently from the others.  From a specified leaf node, a simulation thread performs a random "depth-charge" through the game state until a terminal state is reached.  At this point, it gets the goal values and passes them back to the game searcher thread for back-propagation.

The highly efficient propnet implementation combined with there being several simulation threads means that the game searcher thread is currently the bottleneck in Sancho.  Rather than sitting idle, the simulation threads perform more than 1 simulation per request and feedback an average of the goal values seen.  The number of samples per request is dynamically tuned, based on how long the threads spend blocked waiting for more work, in an attempt to keep the simulation threads running just a little faster than the game searcher thread (because it's the bottleneck and it would be a false economy for the simulation threads to be doing additional simulations if it meant that the game searcher thread was blocked trying to add requests to their work queues).


Pipeline

As a result of the threading model above, each MCTS iteration is first handled by the game searcher thread (for select/expand) then by one of the simulation threads (for simulation) then by the game searcher thread again (for back-propagation).

A couple of years ago, software engineers for a high-frequency financial trading platform discovered that a significant proportion of time was wasted putting messages into queues and then de-queuing them again, because this requires locking for access to the queue structures.  Furthermore, the locks are usually contended because inevitably either the producer is running faster than the consumer or vice-versa and so all the activity is either at the head or the tail of the queue.  To deal with message passing more efficiently, they invented the Disruptor, a lock-free message passing mechanism.

The Disruptor is written in Java and freely available.  Sadly, the Disruptor code isn't a good fit for Sancho's threading model (in particular, passing back and forth between two threads for three phases of processing).  However, the principles still apply and Sancho uses structures inspired by the Disruptor and its Ring Buffer to achieve the same high-performance lock-free architecture for performing MCTS.


Thread affinity

Sancho uses processor affinity for tying CPU-intensive threads to particular cores (or hyper-threads on hyper-threaded systems).  Whether this is beneficial or harmful appears to depend on the processor involved.  On an i7-4770 using 4 out of 8 hyper-threads for CPU-intensive work, we see a 5% improvement from binding each thread to its own core.  On an i5-3470, using 4 out of 4 cores (no hyper-threading) we see no significant difference.  On an i5-3540M, using 2 out of 4 hyper-threads, we see an unquantified but significant decrease in performance by using processor affinity.


Areas for development

I doubt we've finished working on Sancho's threading just yet.
  • The dynamic adjustment of simulations per request isn't as stable as it ought to be.  That just needs more debugging.  Perhaps its the statistics are thrown off by garbage collection, other processes on the system or some other factors.  Who knows?
  • The game searcher thread really is the bottleneck.  It's quite probably that we can't get sufficient improvements in its performance for it to match the game searcher thread.  However, perhaps we can push some of the select/expand work onto the simulation threads without violating the single-writer principle (which would require locking).
  • Perhaps it's possible to create a lock-free MCTS structure that can be operated on by several threads simultaneously - but that's for another day.

 

Footnotes

(1) Because Sancho detects multiple sets of moves leading to the same state (i.e. transpositions), the MCTS "tree" isn't a tree at all.  Instead, it's a directed acyclic graph.
(2) We call these threads "rollout" threads, but "simulate" seems to be the normal terminology for the 3rd step in the MCTS process.

Thursday, May 15, 2014

Propnets - moving to an external representation

Once basic algorithms have been optimized there are typically three main sources of slow-downs when executing code on modern CPUs.  These are:
  1. Thread synchronization overheads.  Since we already have a single-writer design, with separate state data for each thread (aka instance) this is not an issue with our propnets, as they require no locking, or any other form of synchronization.
  2. CPU cache misses.  An L1 cache miss (on x86) will cost circa 10 instructions worth of delay, miss on L2 maybe 30, and a complete cache miss that requires a DRAM access the best part of 1000.  Consequently algorithms that exhibit high degrees of cache friendliness (locality of reference, and predictable stride patterns through data) can potentially operate orders of magnitude faster than algorithms which continually access data scattered all over the memory space.  A good discussion of this can be found in this blog post on memory access patterns.  Our original propnet, which retained state in individual components, themselves located randomly about the Java heap, was extremely lacking in locality of reference, so this is an area that could yield significant performance gains through some judicious restructuring.
  3. Branch prediction misses.  Most modern CPUs are fairly deeply pipelined, meaning that they typically have a number of instructions 'in flow', and at various stages of execution, at once.  Whenever a branch is encountered the CPU has to decide which instructions to feed into the pipeline following the branch, and it has to make this decision before it knows how the branch condition will evaluate.  If it guesses wrong (a branch prediction miss) then the pipeline has to be flushed and restarted, adding significant latency.  Minimizing branch conditions, and maximizing their predictability inside tight, very frequently executed loops therefore can result in substantial speedups.  In our case the innermost computation, which we iterate over a large number of times, is the propagation of a single component.
We addressed these issues by moving the state variables out of the components into much more cache friendly structures, and processing them in a much tighter and more predictable manner.  To review the previous setup, before going into details of the new one - the state variables we hold in each component are the current state, the last propagated state (both boolean) and the true-input-counter (for AND/ORs).  All of these are actually vectors, indexed by instance id, to separate the state of different threads.  This last point (that each is actually a vector) actually hides another problem we will want to address while we're at it - because the state variables (array elements in these vectors) are adjacent to one another in memory (each array of a primitive type is a single allocation in Java), but written by different threads, this gives rise to another performance issue known as false sharing, which can also cause CPU pipeline stalls.

In the new structure we will replace all the state variables, for all the components with a single array of ints per thread.  The elements of these arrays are the states of individual components, indexed by a component id (the exact numbering of the components is functionally arbitrary provided it is consecutive from 0, though as we'll see the ordering is not performance neutral).  Each int uses one bit for current state, and the rest for the counter.  We can actually do away with the last propagated state because we only ever propagate when the state changes, so it is implied by the current state.  The advantages of this basic layout are two-fold:
  • The entire state for one thread is held in a contiguous array of size equal to the number of components in the propnet.  This will easily fit into L3 cache, and for many games into L2 cache.
  • The state for different threads is entirely separated from one another, so the false sharing issue goes away.
We also need to be able to navigate the topology of the propnet (work out what components a given component outputs to) in a cache-friendly manner.  To do this we decant a representation of the propnet into two more arrays, which are read only once created.  The first contains one long per component, with packed bit-fields to represent:
  • The component type
  • The number of inputs it has
  • The number of outputs it has
  • The offset, in the connectivity array (see below) of the first output
The second one, known as the connectivity array, contains the ids of the components the outputs of every component attach to as contiguous sub-regions:


Between them, these two arrays entirely encode the structure of the propnet, and propagation requires only read access to them, as reasonably cache-local structures.

With this structure, propagation of a component is a loop over its outputs (a contiguous region of the connectivity table, with predictable stride), processing each output component, and if that output component itself changes state recursing to propagate it.  It turns out that this loop's major cost is now the branching involved in determining the component type, and processing each type (which itself involves branches depending on what the current state is, the value of the counter, etc.).  Almost all of this unpredictability can however be removed, by making some further optimizations - firstly we can reduce component counts where NOT components occur whose input components do not connect to other components apart from the NOT.  Simplistically we can just add a bit flag to the component type field stating that the output is inverted, and handle the NOT in its parent component (effectively implementing NAND and NOR).  However, it turns out we can do better than that.  Recall that ANDs and ORs are actually implemented as dead-reckoned counters, incremented when an input transitions from false to true, and decremented on the opposite transition.  If we make the 'current state' bit the top bit of the value, then we can arrange for it to switch ONLY by the counter increment and decrement, through deliberate integer overflow.  In fact, by suitably initializing the counter, we can implement all logic components (AND, OR, NAND, NOR, NOT) by exactly the same operation (increment when an input transitions to true, decrement when it transitions to false).  Specifically:
OR: initialize to 0x7FFFFFFF - first increment wraps to negative and sets the top (state) bit
AND: initialize to 0x80000000 - <num inputs> - takes <num inputs> increments to toggle the top bit
NOR: initialize to 0xFFFFFFFF - first increment wraps to 0, clearing the state bit
NAND: initialize to -<num inputs> - after <num input> increments top bit clears

Similarly we can collapse transitions and propositions into generalized trigger-generating components, and use a bit-flag in the component info to determine whether its an InternalMachineState or a LegalMoveSet that acts as a handler for the trigger.  The net result is that we are left with only 2 component types, which requires just one comparison (which is also fairly predictable since logic components typically dominate).  The logic components all use precisely the same action (inc/dec), so no branches involved there.

Finally, we can optimize the id-ordering of components, to maximize the likelihood that consecutively processed outputs are themselves consecutively id'd components (which means increased locality and a more regular stride through the component info array).

Overall, this gives us an extremely tight inner loop to the entire propagation process, which is both cache friendly, and branch-prediction friendly.  Compared to our previous internal-representation this 'fast propagator' structure yields 2-4 times speedups (varying by game)

Wednesday, May 14, 2014

Propnet Optimization

In this post I'm going to talk about three broad categories of optimization pertaining to propnets, and to Sancho's propnets in particular.  These categories are component/propagation optimization, usage optimization, and transformations of the logical propnet to easier-to-compute equivalents.

Component Optimization

At the center of all the computation is the propagation of individual components in response to changes in their inputs, so getting this running efficiently is very important.  Aside from the major optimization of ANDs and ORs using a counter, which I discussed in my previous post, two further important aspects stand out:


  • Efficient enabling of multi-threaded usage.  Ultimately multi-threading is going to be low-hanging fruit for a <number of cores>-ish speed multiple, provided we can keep synchronization overheads very low.  The way we achieve this in Sancho is to entirely eschew any locking, and provide a per-thread instance of the state, which is manipulated by that thread during its propagation processing.  This boils down to maintaining, in each component, a vector for each piece of state information (which is the current output state, last propagated state, and counters for ANDs and ORs).  All getValue()/setValue() methods are then given an extra parameter, which boils down to an index allocated to each thread that will use the propnet.  We call this the instance id.  This approach allows us to maintain a single network of components (so the objects are only present once, and the topology of the network is only represented once), while giving each thread its own state variables, which removes all need for locking.  This is an example of the 'single writer' pattern (actually it's more extreme than that because it's single reader too, but all instances read shared non-volatile connectivity data).  For a great discussion of this pattern see http://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html, which is a blog we'll be referencing several more times in future posts, I have no doubt.
  • Use the simplest representation you can.  In particular use of Java collections, while very expressive and easy to program with, brings with it much more overhead than using arrays.  To this end we dynamically replace all the input and output collections in our components with arrays via a method called 'crystallize()', after which it becomes illegal to attempt any modifications (adding or removing inputs or outputs).  For flexibility we initially construct the components with collection-based inputs and outputs, and retain those through an analysis and network transformation stage (which I'll be talking more about below).  Once all transformations we want to make to the network are complete, we then 'crystallize' it into a runtime form, which replaces all collections with fixed length arrays, or caches the single element in the case of single input components.  At propagation time, all access is thus done through arrays, which operate far more quickly than the collections did.

Usage Patterns of the Propnet

A number of things can be done in the implementation of the state machine, to optimize the usage pattern of the propnet, so as to minimize the number and size of the propagations that need to be done.  Here are a few of them:
  • Make the interface more natural and not require computationally expensive representation transformations and enumerations continuously.  Several examples of this, some of which are mutually synergistic:
  • Replacement of the MachineState class, which represents states in terms of collections of GDL sentences, with an InternalMachineState which instead represents them as collections of Proposition components (actually it's slightly more complex than that, and they are in fact collections of PropositionInfo objects which contain direct references to the corresponding proposition and transition objects in the propnet, as well as some meta info).  Using this representation of a state removes the need to transform (via HashMap lookups) between GDL sentences and Proposition components all the time.  Further, since we know how many base proposition there are, we can once again represent this collection as a fixed length array, and hold the collection itself as a BitSet of the indexes present, which act as pointers into the backing array.  This BitSet representation has all sorts of advantages.  For example finding the propositions that differ between two states is just an XOR operation on the BitSet.
  • Similarly for moves and 'does' propositions.  In particular we can use a BitSet-based collection (which we call the legalMoveSet) to maintain the current set of legal moves, and have direct mapping to the corresponding legal and does propositions from it .
  • Once we have the BitSets it becomes very natural to update them using triggers rather than enumerating through them all and querying the state of their corresponding components in the propnet.  To do this we attach a trigger id (which is actually the index of a move or base prop in the backing array to a state or legal move set) to the legal move propositions and to the base propositions' associated transitions.  During propagation, when a component with a trigger defined undergoes an output state change it signals a trigger interface with its trigger id.  The handler for this just has to set or clear the corresponding bit in the BitSet, which is a very inexpensive operation.  This trigger-based approach removes the need to enumerate through the base propositions when processing getNextState() (to read off the new state), or through the legal propositions when processing getLegalMoves().  Since some games can have thousands of base propositions or legal moves, of which a very small fraction are typically active at any one time, this can be quite a big win.
  • In a similar way, setting a new state on the propnet to perform a differential update from its previous state is also simplified greatly by the use of BitSets.  In the state machine we store (for each instance, to maintain the same threading model) the previous state propagated, and when we are asked to propagate a new one we can just XOR the two states together to find out which propositions we need to change (again enumeration-free except at the get-next-set-bit level of a BitSet, which is pretty cheap)

  •  Use of multiple variations of the propnet (separately instantiated via a cloning operation and then some transformation) for different purposes.  The main examples of this are:
  • What we call a 'split network' representation which is derived as follows.  First we do a small amount of simulation (a few hundred random games) to identify which base propositions change most frequently (for most games this will be the 'control' propositions).  We then pick the most frequently changing one (which we denote the 'XProposition') and instantiate two copies of the propnet, one with the X-prop hardwired to true, and the other with it set false.  These propnets are then reduced to remove now-redundant components (both typically shrink by maybe 30% or so).  We can then label states as X-states or O-states depending on whether they have the X-prop set in them.  The state machine then uses the X-net to make all calculations on X-states, and the O-net to make calculations on O-states.  In many cases further optimization of the network is possible by identifying an O-proposition which is always true when the X-prop isn't and visa versa (control props in 2 player games typically).  In such cases we can hard-wire the O-prop too and make further reductions.  Interestingly, I developed this technique during the first run of he Coursera course, while battling for state-machine performance supremacy with Andrew, and thought that it worked because it allowed me to propagate through smaller propnets all the time.  More recently it has become apparent that that is not the major reason for its success (though it's part of it).  In fact, most of the gain comes from a greatly reduced number of propagations during node expansion.  This is because, during node expansion we enumerate the legal moves, and for each one need to query whether it is terminal or not.  Naively (though there are other ways around this) this manifests as a pattern of method calls against states that looks like:  getNextState(A,move1), isTerminal(B), getNextState(A,move2), isTerminal(C), ... wherein we keep alternating between state A and some other state.  With the split network these calls tend to fall on different propnets, which means that the propnet performing the getNextState() processing is always in the same state, and only its does propositions are changing (much cheaper than having a bunch of base propositions changing too each time).
  • Using a goal-less propnet for playouts.  In many games the logic behind the goals is a large part of the total logic of the game (Reversi is an excellent example), but we have no interest in the goals during playouts until we reach the terminal state.  It is therefore considerably more efficient to perform playouts on a goal-less network and just propagate the terminal state through a goals-only network at the end. 


Transforming the Propnet

Lots of transformations can be performed on a propnet without changing its semantics, by replacing sub-networks with equivalent sub-networks that calculate the same outputs, but with a different layout of internal components.  Simple examples are:

Associative transforms:  (A AND B) AND C == A AND (B AND C) == (A AND B AND C).  In most cases (at least with the counter based implementation for AND/OR) a single large AND or OR will perform better than an equivalent hierarchy.

Distributive transforms: (A OR B) AND (A OR C) == A OR (B AND C)

DeMorgans transforms: NOT (A AND B) = (NOT A) OR (NOT B)

Complimentary transforms: if you know that (A1 OR A2 OR ... OR An) then NOT(OR(Ai)) = OR(A~i) - that is to say, if at least one of a set is known to be true then NOT some subset = OR the complementary set.  Since we know that for any well-formed GDL game and state transition one legal move is played for each role, this can be quite a handy way to reduce the large ORs that tend to arise from GDL 'distinct' clauses over moves.

Also, if any two components in the network compute the same value (e.g. - suppose A AND B occurs twice for some components A and B) then one can be thrown away, and whatever it was outputting to can instead be wired to the other.

Sancho currently makes a number of such transformations:

  • Remove redundant components (NOT NOT, ANDs that feed only into other ANDs, similarly ORs, all constants except a single TRUE and FALSE)
  • Remove components that do not have an output (transition or legal, goal, or terminal proposition) in their descendant network (this usually only occurs after other transformations)
  • Factorize common input logic to a component into a single instance on the output.  For example if an AND is fed only by ORs, all of which have some common inputs, then those inputs can be dropped on each input OR (which removes them if that just leaves either no other inputs, or a single input that wasn't a common factor) and the AND's output fed through an OR that also takes the common factor inputs.  Similarly moving ANDs over ORs
  • Identify components that are ultimately computing the same logic as other components and remove duplicates (rewiring whatever they output to to the equivalent copy)
  • Optimize logic fed by large ORs of DOES propositions in terms of the complementary set, in cases where that is significantly smaller
  • Reduce the network with respect to asserted true/false propositions (as part of the split network discussed above)
  • Remove goals (or everything but goals), as part of producing a goal-less network for reasons discussed above
Currently we do this early on during meta-gaming, but in principal there could be benefit in doing it dynamically each turn (generating a re-trimmed network on  background thread, and then switching it in say), in order to take advantage of identified latch conditions as they become true while playing the game.

Tuesday, May 13, 2014

Propnet basics

I want to spend a few posts talking about our implementation of propnets (propositional networks), which are a representation of a game state machine in terms of:

  • A set of propositions, which may be true or false in any given state, and which together make up that state.  These are known as the 'base propositions'.  An example might be 'cell (2,2) contains an X' in TicTacToe.
  • Some informational (output) propositions, such as 'this state is terminal', or 'this state scores 75 for player 1'
  • (Ouput) Propositions corresponding to every possible move in the game, which are true when that move is legal in the current state
  • (Input) Propositions corresponding to every possible move in the game, set true when the associated move is being played (these will be in (1-1) correspondence with the legal propositions)
  • Outputs (known as transitions) corresponding to each base proposition, whose value is the value that the corresponding base proposition would take (in the successor state to the current state), if the moves indicated by the input move propositions were played
Each of the above, and all the connective logic is represented by logical 'components' in what amount to a virtual logic circuit.  The connective logic is formed of components implementing the standard logical operators (AND, OR, NOT) plus input-less components representing Boolean constants.

When implementing propnets you have two major choices to make, which broadly categorize the different types of implementation:
  1. How to perform propagation of the input values through the network efficiently.  The major choices are forward propagation (set the inputs, then calculate the components they immediately connect to, then the ones those connect to, and so on), and backward propagation (query the output proposition you're interested in and determine its state by recursively querying its inputs, and so on)
  2. How to represent the current state of the network.  Here I will split the discussion between what I am going to call internal representations and external representations.  In an internal representation the components that form the network hold their own state internally.  In an external representation the components themselves are stateless and some external structure holds their current state, so the components simply describe the network topology and semantics, not its current state.
The initial version of Sancho used a differentially-forward propagated internally represented approach.  This has changed recently, with significant performance gains resulting from a change to an external representation, which I'll be talking about in detail in a future post.  In the remainder of this post I will describe the propagation mechanism, and then in the next post I will discuss a number of optimizations.  Finally, in the post after that, I will discuss the move to an external representation and the benefits that brought with it.

Forward Propagation

At its simplest, forward propagation works (approximately) as follows:

The input values to the network are set, and each processed in turn by signalling to components connected immediately to the inputs that they should recalculate.  On being told to recalculate a component will:
  1. Query its input values and from them calculate its new output value
  2. Signal to any components connected to its output that they should recalculate
This causes recursion through the network, ultimately resulting in the new state of (all) outputs being calculated.  It can be thought of as a very direct modelling of an actual circuit (and actually implementing the logic as a physical circuit, say with a programmable logic array is certainly an interesting possibility).

As it stands this is not very efficient, because the same components recalculate their state many times over (each time an input is recalculated to any given component, and that may be one of many).  A standard way to resolve that is to pre-calculate a dependency ordering for the network, and recalculate component states in that order (so an output is not calculated before all of its inputs, and thus only needs to be calculated once), which is loosely analogous to imposing a clock on logic circuitry. If loops occur this needs a little refinement (loops are actually very rare and only occur in certain types of game description), but even used naively is generally reasonably effective.

However, we do not calculate any dependency ordering, and our solution for this first problem drops out of that for the next, which is that even if you calculate a dependency ordering and thus calculate each component just once for each state transition, there is still a major inefficiency, which arises from the typical usage patterns.

When playing most games the state evolves from turn to turn only by small changes (e.g. - one piece moves from one position to another, or a cell is marked on a board and so on).  As a result most of the base propositions remain unchanged across the vast majority of state transitions.  This means that most of the components in the propnet will have the same inputs as they did on the previous calculation, and if we can avoid recalculating them, then we should make major gains.

Differential Propagation

We can easily achieve this by holding two values in each component - the current state and the last propagated state for that component's output.  With that tweak we can modify the recalculate behavior of the component to:
  1. Query its input values and from them calculate its new output value
  2. Compare the new output value with the last propagated output value.  Only if they differ, signal to connected components that they need to recalculate
This also (largely) addresses the first problem and makes pre-calculation of the dependency ordering unnecessary.  It still has some residual benefit, because even with differential propagation more than one input to any given component could still change (causing it to be signaled multiple times).  However, this turns out to be almost a non-issue with one further change, which is itself a significant overall optimization.  In many ways the discussion of this optimization belongs in the planned next post, but because of the synergy with the propagation mechanism I'll describe it here.

Optimizing input querying

With the basic implementation above, it turns out that a major cost is in each signaled component querying its inputs.  In many generated networks some components will have a large number of inputs (many tens of inputs is very typical).  Simply iterating over those inputs and retrieving their values becomes a major cost.  However this can be entirely avoided if we leverage the fact that we have our previous state held in order to perform differential propagation!

First note that the only components that can have multiple inputs are AND and OR components.  Given that, suppose we not only hold the current output state in each component but also a count of the number of inputs whose value is true.  Given a component with N inputs we'll call the number of true inputs T say.  For an AND component the output is (N==T), and for an OR it is (T!=0).  If components only signal their outputs to recalculate when their output state actually changes it follows that being signaled with true implies T := T+1, and with false that T := T-1.  Thus we can dead-reckon the new output state without having to query any other input at all.

This is not only much more efficient than looping over all the inputs, but also resolves the remaining 'multiply calculated' issue fairly effectively (most input changes won't change the output state, and empirically there isn't much left to be gained by doing things in dependency order)

In my next post I'll discuss a wide range of optimizations, both small scale and large.

Sunday, May 11, 2014

What is Sancho

This post is just to provide a little background about the overall structure of the Sancho GGP player, as a basis for later posts.

Sancho is based upon the open-source ggp-base Java codebase for GGP, which may be found on GITHub at https://github.com/ggp-org/ggp-base, and which provides a large set of utility and template classes for the construction of GGP players.  The Sancho forked repository is also available on GITHub at https://github.com/SteveDraper/ggp-base.

At its core, Sancho is an MCTS player with a number of tweaks, the main ones being:

  • A high performance propositional-network-based state machine.  I'll talk more about this in a future post, but for background, propositional networks (or propnets for short) are a variation of Petri networks, and represent the state machine effectively as a sequence of virtual logic gates, whose outputs are propositions about or within the state of the state machine.
  • Replacement of the MCTS tree with a more general graph, which allows for transitions between lines of play without duplication of nodes for any given state
  • Bounded size MCTS node structure, independent of number of expansions performed (necessary for memory scalability as the number of expansions performed increases)
  • Active trimming of lines that have fully-determined outcomes, propagating those outcomes up the MCTS structure to the highest level at which they remain fully determined (with best play)
  • An efficient infrastructure for parallelization across threads
  • Use of heuristics, determined during game setup time, to aid MCTS selection in preferentially expanding some branches
  • Static analysis to identify game factorization opportunities (currently just disjunctive ones), and the ability to play such games with factorized search
  • Static analysis to identify significant latches (ongoing work)


Saturday, May 10, 2014

Introduction

Last year I took a MOOC which was an introduction to the field of General Game Playing (GGP for short), from Coursera (see https://www.coursera.org/course/ggp), which is the study and development of computer game players capable of playing any game defined to them suitably at runtime.

This is very different from development of a player for a specific game (say a chess player or a Go player), and having previously spent some time dabbling with a specific game player for Go, I was interested by this more general challenge.  Since the course, I have continued to develop my GGP player (named 'Sancho'), and more recently been joined by Andrew Rose (who was also a course participant on that first GGP MOOC) as a co-developer of Sancho.  We hope to compete in the world championships later this year.

We decided to start this blog as a way of discussing interesting developments that we've undertaken as part of the ongoing enhancement of Sancho, and periodically we'll be posting about aspects of how it works, and pitfalls we've come across along the way.