Saturday, May 17, 2014

Multi-threading

In Steve's overview of Sancho, he mentioned that we have "an efficient infrastructure for parallelization across threads".  This is an area that I've done a fair bit of work on, so it seemed like a good topic for my first post.

Sancho uses several types of threads.
  1. Those threads started for us by ggp-base.  This includes the thread on which we're asked to submit moves (via the stateMachineSelectMove call).
  2. A single game searcher thread.
  3. Several simulation threads.
The first group of threads is largely uninteresting and everybody who uses ggp-base will get them automatically.  They deal with network communications.

Game searcher thread

The game searcher thread is the primary thread responsible for expanding the game tree(1).  In order to avoid the necessity of locking, this is the only thread that is permitted to write to the MCTS "tree".  This is another example of the single-writer principle which plays a significant roll in avoid time wasted through synchronization.

The game searcher thread is started during meta-gaming and then runs continuously throughout the game.  This means that Sancho can continue to search the game tree at all times, including (a) during the normal time for considering moves, (b) after submitting its move but whilst opponents are still thinking and (c) once all players have submitted their moves but before the server has asked for the next move (which can be a very considerable interval on the Tiltyard, especially when busy).

The game searcher thread is responsible for the MCTS "select" and "expand" phases as well as the "back-propagation" phase because all of these phases are intimately involved with the MCTS tree.  However, the "simulate" phase of MCTS has no use of the game tree at all and can be performed independently.


Simulation threads

The simulation threads(2) are responsible for the "simulate" phase of MCTS.  Sancho uses a pool of these threads, each running completely independently from the others.  From a specified leaf node, a simulation thread performs a random "depth-charge" through the game state until a terminal state is reached.  At this point, it gets the goal values and passes them back to the game searcher thread for back-propagation.

The highly efficient propnet implementation combined with there being several simulation threads means that the game searcher thread is currently the bottleneck in Sancho.  Rather than sitting idle, the simulation threads perform more than 1 simulation per request and feedback an average of the goal values seen.  The number of samples per request is dynamically tuned, based on how long the threads spend blocked waiting for more work, in an attempt to keep the simulation threads running just a little faster than the game searcher thread (because it's the bottleneck and it would be a false economy for the simulation threads to be doing additional simulations if it meant that the game searcher thread was blocked trying to add requests to their work queues).


Pipeline

As a result of the threading model above, each MCTS iteration is first handled by the game searcher thread (for select/expand) then by one of the simulation threads (for simulation) then by the game searcher thread again (for back-propagation).

A couple of years ago, software engineers for a high-frequency financial trading platform discovered that a significant proportion of time was wasted putting messages into queues and then de-queuing them again, because this requires locking for access to the queue structures.  Furthermore, the locks are usually contended because inevitably either the producer is running faster than the consumer or vice-versa and so all the activity is either at the head or the tail of the queue.  To deal with message passing more efficiently, they invented the Disruptor, a lock-free message passing mechanism.

The Disruptor is written in Java and freely available.  Sadly, the Disruptor code isn't a good fit for Sancho's threading model (in particular, passing back and forth between two threads for three phases of processing).  However, the principles still apply and Sancho uses structures inspired by the Disruptor and its Ring Buffer to achieve the same high-performance lock-free architecture for performing MCTS.


Thread affinity

Sancho uses processor affinity for tying CPU-intensive threads to particular cores (or hyper-threads on hyper-threaded systems).  Whether this is beneficial or harmful appears to depend on the processor involved.  On an i7-4770 using 4 out of 8 hyper-threads for CPU-intensive work, we see a 5% improvement from binding each thread to its own core.  On an i5-3470, using 4 out of 4 cores (no hyper-threading) we see no significant difference.  On an i5-3540M, using 2 out of 4 hyper-threads, we see an unquantified but significant decrease in performance by using processor affinity.


Areas for development

I doubt we've finished working on Sancho's threading just yet.
  • The dynamic adjustment of simulations per request isn't as stable as it ought to be.  That just needs more debugging.  Perhaps its the statistics are thrown off by garbage collection, other processes on the system or some other factors.  Who knows?
  • The game searcher thread really is the bottleneck.  It's quite probably that we can't get sufficient improvements in its performance for it to match the game searcher thread.  However, perhaps we can push some of the select/expand work onto the simulation threads without violating the single-writer principle (which would require locking).
  • Perhaps it's possible to create a lock-free MCTS structure that can be operated on by several threads simultaneously - but that's for another day.

 

Footnotes

(1) Because Sancho detects multiple sets of moves leading to the same state (i.e. transpositions), the MCTS "tree" isn't a tree at all.  Instead, it's a directed acyclic graph.
(2) We call these threads "rollout" threads, but "simulate" seems to be the normal terminology for the 3rd step in the MCTS process.

5 comments:

  1. To avoid asymmetry and also as it seemed a quicker solution to implement at the time, in my player all threads do the same, i.e. run the MCTS loop in parallel on the same tree. The loop looks like:
    BackPropagate(node, utility);
    node = Select();
    node = Expand(node);
    utility = Simulate(node);
    I currently have a lock around the first two calls, which are very quick compared to the other two that can happen in parallel. This is for ease of implementation, in theory I think the first two phases could use just interlocked operations.

    ReplyDelete
    Replies
    1. If you don't lock the expand phase what happens when the node you are part way through expanding on on thread is selected on another?

      Delete
  2. Selection is handled by Select() (inside the lock), which also marks nodes that are going to be expanded as unavailable to other threads (until BackPropagate marks them as available again).

    ReplyDelete
    Replies
    1. Actually, now that I think about it, it should be possible move the 'unmarking' of the node to just after Expand rather than waiting until the BackPropagate() call, as no synchronization is necessary for that.

      Delete
    2. There might be a problem with this (though I couldn't say how impactful). Specifically if you mark a node as unavailable until back propagate unmarks it, you force selection to go for other alternatives, which necessitates a broader search than might be the case if it would (by pure UCT) want to select very narrowly due (say) to a particularly good move that should get the vast bulk of the selections going through it.

      A similar issues occurs with Sancho's architecture, because the rollouts are sitting on a queue for a while before the results are known. To address that we back propagate a visit count update separately from the score update (and do the visit count update during the selection pass). This allows UCT to see the queued exploration rather than the fully-processed one, which allows it to decide whether to revisit the same node again or not even before the previous result is known. This allows it to rollout all the children of the recently expanded node(s) even before the rollout of the node itself has a known result.

      Delete