Tuesday, June 17, 2014

Sancho goes green

In this post, I’ll describe how Sancho gained a significant improvement in performance - by going green.

Reduce, reuse, recycle

As with many other cities around the world, the City of Edinburgh Council encourages its residents to reduce, reuse & recycle - in that order.
  • Reduce: Don’t create the rubbish in the first place. (Buy less, buy items with less packaging, etc.)
  • Reuse: Don’t throw things away after their first use. Use them again.
  • Recycle: If you can’t reuse it, recycle it.
  • Dispose: If all else fails, put it out with the garbage.
So, what does this have to do with Sancho & GGP? One of the first things that we noticed after getting graphing going is that we had a big garbage collection problem. If you look at a typical graph from a game of Checkers a couple of weeks ago, you’ll see what I mean. Every 3 seconds or so, the JVM was performing >400ms of garbage collection. The largest spike was 1620ms of garbage collection in 1 second. (The G1 garbage collector is multi-threaded.) If you look at the slope of the memory usage graph between collections, you’ll see that Sancho is regularly allocating in excess of 130MB/s - all of which is garbage collected almost immediately.

So we set about doing something about it.

Reduce

The first and biggest difference was made by not allocating objects that we didn’t strictly need.
  • When Sancho expanded a node in the MCTS tree, it would create a node for every child - even though it would only do an MCTS simulation from one of them. This was a massive waste and meant that, in a game with a (relatively small) branching factor of 8, only 12% of the fringe nodes had any useful information stored in them. Most of them would never be visited and would be discarded when the tree was pruned at the start of a new turn.
  • There was some old code lying around that allocated and initialized two new double[]s for every node in the path through the MCTS tree for every iteration. However, this array was used for precisely nothing!
  • Sancho does the back-propagation phase recursively. However, it’s using tail recursion which can (and should) be replaced by iteration. This is a slightly subtle one. Stack frames (needed for each recursive call) aren’t allocated on the heap and garbage collected. However, they do affect garbage collection. The garbage collector has to examine every stack frame from every thread (whilst all the programs threads have been halted) to find “GC roots”. So, the deeper the callstack, the longer the GC pause. (I haven’t implemented the fix for this yet, but it should be straightforward.)

Reuse

Several frequently called routines were allocating temporary local arrays for carrying out their processing. This produced lots of garbage. We sorted these out by statically allocating (*) the arrays and other objects used in these routines and then reusing them as required.

(*) Actually, we don’t use static allocation because we support running multiple instances of Sancho within the same process. However, we have some suitable per-Sancho-instance storage that is effectively static and is what we use for this.

Recycle

Another massive reduction in garbage was achieved by recycling objects that were no longer needed. Sancho has always limited its MCTS tree to 2,000,000 nodes. In order to do that, it has to throw old nodes away and create new ones. As well as the nodes, Sancho has objects representing edges, game states, scores, etc., etc., all of which were being thrown away and new ones created - another source of garbage.

The key to addressing this problem was recycling the objects via pooling. All our frequently used objects are allocated from a pool of the appropriate type of object (or are directly embedded in a pooled object in a 1-to-1 mapping and allocated only when the parent object is allocated). When Sancho is finished with a pooled object, it returns the object to the pool where it is “recycled” (i.e. has its members reset), ready to be handed out on the next allocation request.

This removes some of the advantage of Java in that it is once again necessary to keep track of allocated objects, being sure the return them to the pool when no longer needed. However, from a little bit of programming pain comes big performance gains.

Dispose

We’re still not completely garbage free. If you look at a graph of Sancho playing the same game (Checkers), you’ll see that it now allocates <3MB/s, down from >130MB/s. And instead of doing 2000ms of garbage collection in 5 bursts across ~15 seconds, it now interrupts processing just twice to do 2000ms of garbage collection in a whole 20-minute game (not counting collection done during meta-gaming which we haven’t attempted to optimize).

2 comments:

  1. Always good to hear more about sancho dev - thanks guys for sharing. Seems since end of coursera I've spent most my time trying to fix timeouts via eliminating allocations - today I introduced some new code for simultaneous games where I thought surely a few array creations won't make a difference, and was subsequently gifted a nice 198 seconds garbage collection during chinook! :)
    Do you know if gc-roots on the stack are followed for incremental collections, or just full 'stop the world' ones?

    ReplyDelete
  2. I think stack-based GC roots are identified for all collection types.

    G1 doesn't have such a black & white approach to stop-the-world vs concurrent collection. In most collection cycles, there are some parts where everything is stopped and some concurrent parts.

    Identifying GC roots in stack frames always requires a pause and since it's done in every collection, it's important not to have very deep stacks.

    ReplyDelete