Sancho GGP Player: Propnet Optimization

In this post I'm going to talk about three broad categories of optimization pertaining to propnets, and to Sancho's propnets in particular. These categories are component/propagation optimization, usage optimization, and transformations of the logical propnet to easier-to-compute equivalents.

Component Optimization

At the center of all the computation is the propagation of individual components in response to changes in their inputs, so getting this running efficiently is very important. Aside from the major optimization of ANDs and ORs using a counter, which I discussed in my previous post, two further important aspects stand out:

Efficient enabling of multi-threaded usage. Ultimately multi-threading is going to be low-hanging fruit for a <number of cores>-ish speed multiple, provided we can keep synchronization overheads very low. The way we achieve this in Sancho is to entirely eschew any locking, and provide a per-thread instance of the state, which is manipulated by that thread during its propagation processing. This boils down to maintaining, in each component, a vector for each piece of state information (which is the current output state, last propagated state, and counters for ANDs and ORs). All getValue()/setValue() methods are then given an extra parameter, which boils down to an index allocated to each thread that will use the propnet. We call this the instance id. This approach allows us to maintain a single network of components (so the objects are only present once, and the topology of the network is only represented once), while giving each thread its own state variables, which removes all need for locking. This is an example of the 'single writer' pattern (actually it's more extreme than that because it's single reader too, but all instances read shared non-volatile connectivity data). For a great discussion of this pattern see http://mechanical-sympathy.blogspot.com/2011/09/single-writer-principle.html, which is a blog we'll be referencing several more times in future posts, I have no doubt.
Use the simplest representation you can. In particular use of Java collections, while very expressive and easy to program with, brings with it much more overhead than using arrays. To this end we dynamically replace all the input and output collections in our components with arrays via a method called 'crystallize()', after which it becomes illegal to attempt any modifications (adding or removing inputs or outputs). For flexibility we initially construct the components with collection-based inputs and outputs, and retain those through an analysis and network transformation stage (which I'll be talking more about below). Once all transformations we want to make to the network are complete, we then 'crystallize' it into a runtime form, which replaces all collections with fixed length arrays, or caches the single element in the case of single input components. At propagation time, all access is thus done through arrays, which operate far more quickly than the collections did.

Usage Patterns of the Propnet

A number of things can be done in the implementation of the state machine, to optimize the usage pattern of the propnet, so as to minimize the number and size of the propagations that need to be done. Here are a few of them:

Make the interface more natural and not require computationally expensive representation transformations and enumerations continuously. Several examples of this, some of which are mutually synergistic:

Replacement of the MachineState class, which represents states in terms of collections of GDL sentences, with an InternalMachineState which instead represents them as collections of Proposition components (actually it's slightly more complex than that, and they are in fact collections of PropositionInfo objects which contain direct references to the corresponding proposition and transition objects in the propnet, as well as some meta info). Using this representation of a state removes the need to transform (via HashMap lookups) between GDL sentences and Proposition components all the time. Further, since we know how many base proposition there are, we can once again represent this collection as a fixed length array, and hold the collection itself as a BitSet of the indexes present, which act as pointers into the backing array. This BitSet representation has all sorts of advantages. For example finding the propositions that differ between two states is just an XOR operation on the BitSet.

Similarly for moves and 'does' propositions. In particular we can use a BitSet-based collection (which we call the legalMoveSet) to maintain the current set of legal moves, and have direct mapping to the corresponding legal and does propositions from it .

Once we have the BitSets it becomes very natural to update them using triggers rather than enumerating through them all and querying the state of their corresponding components in the propnet. To do this we attach a trigger id (which is actually the index of a move or base prop in the backing array to a state or legal move set) to the legal move propositions and to the base propositions' associated transitions. During propagation, when a component with a trigger defined undergoes an output state change it signals a trigger interface with its trigger id. The handler for this just has to set or clear the corresponding bit in the BitSet, which is a very inexpensive operation. This trigger-based approach removes the need to enumerate through the base propositions when processing getNextState() (to read off the new state), or through the legal propositions when processing getLegalMoves(). Since some games can have thousands of base propositions or legal moves, of which a very small fraction are typically active at any one time, this can be quite a big win.

In a similar way, setting a new state on the propnet to perform a differential update from its previous state is also simplified greatly by the use of BitSets. In the state machine we store (for each instance, to maintain the same threading model) the previous state propagated, and when we are asked to propagate a new one we can just XOR the two states together to find out which propositions we need to change (again enumeration-free except at the get-next-set-bit level of a BitSet, which is pretty cheap)

Use of multiple variations of the propnet (separately instantiated via a cloning operation and then some transformation) for different purposes. The main examples of this are:

What we call a 'split network' representation which is derived as follows. First we do a small amount of simulation (a few hundred random games) to identify which base propositions change most frequently (for most games this will be the 'control' propositions). We then pick the most frequently changing one (which we denote the 'XProposition') and instantiate two copies of the propnet, one with the X-prop hardwired to true, and the other with it set false. These propnets are then reduced to remove now-redundant components (both typically shrink by maybe 30% or so). We can then label states as X-states or O-states depending on whether they have the X-prop set in them. The state machine then uses the X-net to make all calculations on X-states, and the O-net to make calculations on O-states. In many cases further optimization of the network is possible by identifying an O-proposition which is always true when the X-prop isn't and visa versa (control props in 2 player games typically). In such cases we can hard-wire the O-prop too and make further reductions. Interestingly, I developed this technique during the first run of he Coursera course, while battling for state-machine performance supremacy with Andrew, and thought that it worked because it allowed me to propagate through smaller propnets all the time. More recently it has become apparent that that is not the major reason for its success (though it's part of it). In fact, most of the gain comes from a greatly reduced number of propagations during node expansion. This is because, during node expansion we enumerate the legal moves, and for each one need to query whether it is terminal or not. Naively (though there are other ways around this) this manifests as a pattern of method calls against states that looks like: getNextState(A,move1), isTerminal(B), getNextState(A,move2), isTerminal(C), ... wherein we keep alternating between state A and some other state. With the split network these calls tend to fall on different propnets, which means that the propnet performing the getNextState() processing is always in the same state, and only its does propositions are changing (much cheaper than having a bunch of base propositions changing too each time).

Using a goal-less propnet for playouts. In many games the logic behind the goals is a large part of the total logic of the game (Reversi is an excellent example), but we have no interest in the goals during playouts until we reach the terminal state. It is therefore considerably more efficient to perform playouts on a goal-less network and just propagate the terminal state through a goals-only network at the end.

Transforming the Propnet

Lots of transformations can be performed on a propnet without changing its semantics, by replacing sub-networks with equivalent sub-networks that calculate the same outputs, but with a different layout of internal components. Simple examples are:

Associative transforms: (A AND B) AND C == A AND (B AND C) == (A AND B AND C). In most cases (at least with the counter based implementation for AND/OR) a single large AND or OR will perform better than an equivalent hierarchy.

Distributive transforms: (A OR B) AND (A OR C) == A OR (B AND C)

DeMorgans transforms: NOT (A AND B) = (NOT A) OR (NOT B)

Complimentary transforms: if you know that (A1 OR A2 OR ... OR An) then NOT(OR(Ai)) = OR(A~i) - that is to say, if at least one of a set is known to be true then NOT some subset = OR the complementary set. Since we know that for any well-formed GDL game and state transition one legal move is played for each role, this can be quite a handy way to reduce the large ORs that tend to arise from GDL 'distinct' clauses over moves.

Also, if any two components in the network compute the same value (e.g. - suppose A AND B occurs twice for some components A and B) then one can be thrown away, and whatever it was outputting to can instead be wired to the other.

Sancho currently makes a number of such transformations:

Remove redundant components (NOT NOT, ANDs that feed only into other ANDs, similarly ORs, all constants except a single TRUE and FALSE)
Remove components that do not have an output (transition or legal, goal, or terminal proposition) in their descendant network (this usually only occurs after other transformations)
Factorize common input logic to a component into a single instance on the output. For example if an AND is fed only by ORs, all of which have some common inputs, then those inputs can be dropped on each input OR (which removes them if that just leaves either no other inputs, or a single input that wasn't a common factor) and the AND's output fed through an OR that also takes the common factor inputs. Similarly moving ANDs over ORs
Identify components that are ultimately computing the same logic as other components and remove duplicates (rewiring whatever they output to to the equivalent copy)
Optimize logic fed by large ORs of DOES propositions in terms of the complementary set, in cases where that is significantly smaller
Reduce the network with respect to asserted true/false propositions (as part of the split network discussed above)
Remove goals (or everything but goals), as part of producing a goal-less network for reasons discussed above

Currently we do this early on during meta-gaming, but in principal there could be benefit in doing it dynamically each turn (generating a re-trimmed network on background thread, and then switching it in say), in order to take advantage of identified latch conditions as they become true while playing the game.

4 comments:

UnknownMay 14, 2014 at 9:39 PM
Fantastic...keep it up!
Lev BMay 25, 2014 at 11:45 AM
I wonder why you decided to use BitSet rather than boolean[]. I suppose it's because of built-in operations such as xor, but on the other hand, boolean[] is faster.

On the other hand, if you store the network state externally, and there are many components, BitSet might be faster because it'll fit into less cache lines.
Steve DraperMay 25, 2014 at 2:00 PM
Mostly because it was a convenient abstraction and took less memory (important since we hold a state in every tree node). It also implements things like XOR faster than a boolean[] array would (it uses intrinsics), and probably is also faster for operations like finding the first set bit after a given index for the same reason. I have considered rewriting the class to use long[] and doing the bit-packing myself because that would yield faster enumeration and probably not hurt any other operation much, if at all. However, it doesn't show up as hot on the profiling, so it's not worth spending effort on currently.
Andrew RoseAugust 18, 2014 at 1:52 AM
We've subsequently moved to using OpenBitSet, which is much faster than the standard BitSet and gives us some additional operations that we wanted in a much more efficient fashion. For example, it can compute the size of an intersection of two sets without constructing an intermediate object to represent the intersection of the two sets. Not only is this good for performance generally, it's also critical for avoiding garbage and the consequent GC problems.

Wednesday, May 14, 2014

Propnet Optimization

4 comments: