Thursday, May 15, 2014

Propnets - moving to an external representation

Once basic algorithms have been optimized there are typically three main sources of slow-downs when executing code on modern CPUs.  These are:
  1. Thread synchronization overheads.  Since we already have a single-writer design, with separate state data for each thread (aka instance) this is not an issue with our propnets, as they require no locking, or any other form of synchronization.
  2. CPU cache misses.  An L1 cache miss (on x86) will cost circa 10 instructions worth of delay, miss on L2 maybe 30, and a complete cache miss that requires a DRAM access the best part of 1000.  Consequently algorithms that exhibit high degrees of cache friendliness (locality of reference, and predictable stride patterns through data) can potentially operate orders of magnitude faster than algorithms which continually access data scattered all over the memory space.  A good discussion of this can be found in this blog post on memory access patterns.  Our original propnet, which retained state in individual components, themselves located randomly about the Java heap, was extremely lacking in locality of reference, so this is an area that could yield significant performance gains through some judicious restructuring.
  3. Branch prediction misses.  Most modern CPUs are fairly deeply pipelined, meaning that they typically have a number of instructions 'in flow', and at various stages of execution, at once.  Whenever a branch is encountered the CPU has to decide which instructions to feed into the pipeline following the branch, and it has to make this decision before it knows how the branch condition will evaluate.  If it guesses wrong (a branch prediction miss) then the pipeline has to be flushed and restarted, adding significant latency.  Minimizing branch conditions, and maximizing their predictability inside tight, very frequently executed loops therefore can result in substantial speedups.  In our case the innermost computation, which we iterate over a large number of times, is the propagation of a single component.
We addressed these issues by moving the state variables out of the components into much more cache friendly structures, and processing them in a much tighter and more predictable manner.  To review the previous setup, before going into details of the new one - the state variables we hold in each component are the current state, the last propagated state (both boolean) and the true-input-counter (for AND/ORs).  All of these are actually vectors, indexed by instance id, to separate the state of different threads.  This last point (that each is actually a vector) actually hides another problem we will want to address while we're at it - because the state variables (array elements in these vectors) are adjacent to one another in memory (each array of a primitive type is a single allocation in Java), but written by different threads, this gives rise to another performance issue known as false sharing, which can also cause CPU pipeline stalls.

In the new structure we will replace all the state variables, for all the components with a single array of ints per thread.  The elements of these arrays are the states of individual components, indexed by a component id (the exact numbering of the components is functionally arbitrary provided it is consecutive from 0, though as we'll see the ordering is not performance neutral).  Each int uses one bit for current state, and the rest for the counter.  We can actually do away with the last propagated state because we only ever propagate when the state changes, so it is implied by the current state.  The advantages of this basic layout are two-fold:
  • The entire state for one thread is held in a contiguous array of size equal to the number of components in the propnet.  This will easily fit into L3 cache, and for many games into L2 cache.
  • The state for different threads is entirely separated from one another, so the false sharing issue goes away.
We also need to be able to navigate the topology of the propnet (work out what components a given component outputs to) in a cache-friendly manner.  To do this we decant a representation of the propnet into two more arrays, which are read only once created.  The first contains one long per component, with packed bit-fields to represent:
  • The component type
  • The number of inputs it has
  • The number of outputs it has
  • The offset, in the connectivity array (see below) of the first output
The second one, known as the connectivity array, contains the ids of the components the outputs of every component attach to as contiguous sub-regions:

Between them, these two arrays entirely encode the structure of the propnet, and propagation requires only read access to them, as reasonably cache-local structures.

With this structure, propagation of a component is a loop over its outputs (a contiguous region of the connectivity table, with predictable stride), processing each output component, and if that output component itself changes state recursing to propagate it.  It turns out that this loop's major cost is now the branching involved in determining the component type, and processing each type (which itself involves branches depending on what the current state is, the value of the counter, etc.).  Almost all of this unpredictability can however be removed, by making some further optimizations - firstly we can reduce component counts where NOT components occur whose input components do not connect to other components apart from the NOT.  Simplistically we can just add a bit flag to the component type field stating that the output is inverted, and handle the NOT in its parent component (effectively implementing NAND and NOR).  However, it turns out we can do better than that.  Recall that ANDs and ORs are actually implemented as dead-reckoned counters, incremented when an input transitions from false to true, and decremented on the opposite transition.  If we make the 'current state' bit the top bit of the value, then we can arrange for it to switch ONLY by the counter increment and decrement, through deliberate integer overflow.  In fact, by suitably initializing the counter, we can implement all logic components (AND, OR, NAND, NOR, NOT) by exactly the same operation (increment when an input transitions to true, decrement when it transitions to false).  Specifically:
OR: initialize to 0x7FFFFFFF - first increment wraps to negative and sets the top (state) bit
AND: initialize to 0x80000000 - <num inputs> - takes <num inputs> increments to toggle the top bit
NOR: initialize to 0xFFFFFFFF - first increment wraps to 0, clearing the state bit
NAND: initialize to -<num inputs> - after <num input> increments top bit clears

Similarly we can collapse transitions and propositions into generalized trigger-generating components, and use a bit-flag in the component info to determine whether its an InternalMachineState or a LegalMoveSet that acts as a handler for the trigger.  The net result is that we are left with only 2 component types, which requires just one comparison (which is also fairly predictable since logic components typically dominate).  The logic components all use precisely the same action (inc/dec), so no branches involved there.

Finally, we can optimize the id-ordering of components, to maximize the likelihood that consecutively processed outputs are themselves consecutively id'd components (which means increased locality and a more regular stride through the component info array).

Overall, this gives us an extremely tight inner loop to the entire propagation process, which is both cache friendly, and branch-prediction friendly.  Compared to our previous internal-representation this 'fast propagator' structure yields 2-4 times speedups (varying by game)

1 comment:

  1. Just an FYI - managed to compress the representation even further (into a single structure array), which yielded about another 5-10% performance gain (circa 40,000 games per second for Connect4 now, with one thread)