(Who ate all the pies?)

In the International General Game Playing Championship 2014, it struck me that many of the games that were played had significant bias for one player (usually, but not always, the first). In an attempt to compensate, many games were played both ways round, often with the predictable outcome. This was unfortunate because it increased the amount of time required to play games, reduced the excitement (by having predictable outcomes) and effectively reduced the best-of-3 rounds to a best-of-one (two matches of a game with a predictable outcome and then the decider - usually a game where the bias is unknown because it isn't played on Tiltyard and therefore statistics aren't available).

Whilst reading up on various abstract strategy games, I was introduced to the concept of the Pie rule. Based on the traditional method for children to divide a cake ("you cut, I'll choose"), it aims to balance two player games by giving the second player the opportunity to swap places with the first player after the first move. This way, the first player has an incentive to play the most balanced initial play the he possibly can. (If he makes an opening move that is too strong, the second player will swap with him and get the strong position. If he plays an opening move that is too weak, the second player will leave him with it.)

In the last few days, I have created Pie-rule variants of 9-board Tic-Tac-Toe and Hex. (I've also created a visualization for regular Hex so that it can be played on Tiltyard.) It's early days yet, but I notice that the pie rule seems to be doing the trick for 9BTTT. In the first 16 matches played on the Tiltyard, 9 went with the first player and 7 with the second. Whilst it's still a small sample size, that's looking substantially more balanced than the regular version.

So, for the GDL authors out there (by which I suppose I mean Alex!), consider the Pie rule for creating balance in your game.

For everybody else, what games would you like to see a Pie rule variant of? Whilst I'm certainly not making any promises, if you post in the comments, I'll consider doing them.

## Wednesday, August 20, 2014

## Sunday, August 17, 2014

### Goal Latches

Latches are properties of a game's state which, once they become true, are guaranteed to remain true throughout the remainder of the game (positively latched); or which, once they become false, are guaranteed to remain false (negatively latched). A 'property' in this sense can be any logical function of the game's base state - most typically we talk about the simplest case of this, which is the state of an individual base proposition.

Identifying latches can be useful in several ways:

Identifying latches can be useful in several ways:

- It may allow reduced computational effort in calculating state transitions (e.g. - if a part of the propnet is effectively 'dark' due to a latch condition, then once that latch is detected that part of the propnet can be left uncalculated in future state transitions).
- Similarly it may allow reduced computational effort in heuristic calculation
- If the latch constrains the achievable goals then it can be used to optimize search

This post is about the use Sancho makes of the last case - i.e. - situations in which it detects that the achievable score range has been reduced by latching of one or more of the goal propositions (at most one for each role can become positively latched, in which case the score for that role is then fully determined in all possible paths, but several may become independently negatively latched, which serves to reduce the size of the set of achievable scores).

Latch detection might be analytic or heuristic in nature. I'll come back to heuristic latch detection in a future post - for this one I'm only considering analytic detection, though the use to which the result is put can be the same in either case.

Analytic latch detection is based on logical reasoning about the structure of the game. In Sancho's case this is done by analysis of the propnet, and currently we only identify fairly simple cases in which a base proposition feeds back to itself via logic of one of the following forms (where X is the base prop, and X' is its next value on state transition):

X' = X OR <other condition> (X is positively latched)

or

X' = ~X AND <other condition> (X is negatively latched)

Furthermore, if a goal proposition is of the form

G = X AND <other condition>, and X is negatively latched, then G is said to be negatively latched by X.

Similarly for the positive latch case:

G = X OR <other condition>

If we can identify such goal latches, we know that the set of achievable scores in any state of the game reachable from a state with latched goal propositions is either fully determined (positively latched goal), or is strictly reduced from the set of all possible goal values to some subset.

In Sancho this surfaces as a method on the state machine which returns the achievable goal range for any given state, as [min,max]. This range can be used in the MCTS search tree processing in several ways:

- If min == max, then all possible paths result in the same score. If this occurs for our role then whatever we play from that point on can have no impact on the eventual outcome. Consequently we can treat any tree node in which this occurs as if it were terminal, and not search below it (the one exception being that if this occurs on the root node we expand one level just so that we have legal moves to choose from when asked to return a move)
- If a terminal state is encountered with a score of the max achievable, we can invoke tree trimming by propagating upwards when the role deciding on the move at this branch is the one with the max-achievable score (i.e. - this amounts to a 'win' relative to the parent state and no further searching is needed since it cannot be improved upon)
- If a move reduces the achievable range (either at the top or at the bottom) then we can trivially define a heuristic to favor search on paths that increase the min achievable score, and dis-favor those that decrease the max achievable.

Because latch detection is typically very cheap (at search time, once the analysis has been done during meta-gaming) we can also use the change in achievable score as a heuristic during playouts to guide otherwise random playouts along paths that tend to maximize the expected goal value, on the assumption (within the playout) that all achievable scores are equally likely. This can most simply be done by simply down-weighting moves that decrease the choosing role's max score, and up-weighting those that increase the min. In zero-sum games, up-weighting those that decrease the max opponent role scores, and down-weighting those that increase their min scores can also be applied. Currently we do this VERY crudely and handle the 'decrease weight' cases by preventing those moves being selected at all in the presence of any other choice!

Some examples help illustrate where these techniques give benefit, so I'll discuss a few in the following paragraphs.

**The Untwisty Corridor Family**

This family of games are basically just test cases for latch detection. They are puzzles involving traversing a maze (logically speaking anyway - none of them actually have visualization currently so far as I am aware). Any step onto the wrong path sets a 'failure' proposition in the game's state, and the game terminates after a fixed number of moves. Consequently you must make the 'correct' choice for each of a fixed number of steps (in 'untwisty corridor' you need to go straight every move, hence the name). Because a wrong choice sets a latched proposition, which ultimately determines the goal values, this is a simple latched-goal case, where any wrong move immediately fully determines the final score (as 0). Because all such states are immediately treated as if they were terminal, and not searched, the effect is that the MCTS search only ever visits nodes one step off the correct branch, which reduced the size of the search space from being exponential in number of steps to being linear.

In fact, the basic untwisty corridor is somewhat flawed, because all false choices lead to the SAME state (apart from the step counter), so it is trivially solved by transposition, provided the player maintains a transposition table of some sort. Furthermore, at only 6 steps, it is small enough to solve by brute force anyway! The game 'untwistycomplex2' is an attempt to address these issues, and a better test.

**Escort Latch Breakthrough**

The game Escort Latch Breakthrough is a variant of Breakthrough where each role has a king and 8 pawns. The goal is to get the King to the far end of the board. If both kings are captured the game is a draw. A typical MCTS player, without latch analysis, will usually move a pawn from in front of their king and then just advance the king right up to one rank before it can be captured by the opponent's pawns. At that point it's usually fairly trivial for the opponent to force capture it (kings cannot move backwards) and games between such players tend to end up as mutual king exchanges and therefore draws (for example, this game). The reason this happens is that the vulnerability of the king has very little impact on the ensemble of playout scores, because almost all playouts fail to bother capturing the kings, and MCTS convergence is slowed by the need to achieve convergence in subtrees following king capture.

With latch detection a king capture is seen as an achievable score range change (from [0,100] to [50,100] or [0,50] depending on which king is captured). This impacts the search in several ways:

- In states where one king has been captured, capturing the other is terminal, and can be propagated as if it were a win (even though it scores only 50), because it is the best achievable result
- Moves that capture/avoid capture of a king are heuristically selected more frequently, which speeds convergence
- Playouts are much more reasonable, since they do not include games in which a king is NOT captured if it could be (i.e. - where a king just marches through a defending line of pawns without being molested). This dramatically improves the quality of Monte Carol samples generated by the playouts. This gain far outweighs the cost in reduced playout count that the evaluation requires (which is actually quite substantial since all moves must be examined at each stage)

The result is that rather than optimistically advancing their king into danger, the player tends to hold it back, and work to capture its opponent's king first (see this game for example)

## Friday, August 1, 2014

### Puzzling it out

During the recent competition I promised to spill the beans on our Sudoku-enabling analysis, and other (mostly) puzzle-specific enhancements. Accordingly, this post will go through the more interesting techniques Sancho uses in puzzle solving (i.e. - 1-player games, or pseudo-1-player games, which I will define later).

The first observation that can be made universally about 1-player (deterministic, complete information) games, is that the player is entirely in charge of its own fate, and doesn't have to worry about what some other entity might do. This means that any solution path seen will always remain valid, if followed from that point onward. We therefore trap any sight of a solution state and commit the path we saw to it as a 'plan', after which, when asked for a move, we simply read the next move off the plan (hence the very fast response that Sancho exhibits in puzzles once it solves it, since no further searching is required).

Opportunities for caching plans arise both during tree node expansion (when a child turns out to be terminal and a solution), and during MonteCarlo playouts (from which we return the full path played in 1-player games).

For many of the typical puzzles encountered, a solution will be found during meta-gaming time, so as soon as the game manager starts asking for moves, we will be able to feed them back as fast as it can consume them. In the recent tournament we were able to find and cache plans during meta-gaming for each of the following:

**Plans**The first observation that can be made universally about 1-player (deterministic, complete information) games, is that the player is entirely in charge of its own fate, and doesn't have to worry about what some other entity might do. This means that any solution path seen will always remain valid, if followed from that point onward. We therefore trap any sight of a solution state and commit the path we saw to it as a 'plan', after which, when asked for a move, we simply read the next move off the plan (hence the very fast response that Sancho exhibits in puzzles once it solves it, since no further searching is required).

Opportunities for caching plans arise both during tree node expansion (when a child turns out to be terminal and a solution), and during MonteCarlo playouts (from which we return the full path played in 1-player games).

For many of the typical puzzles encountered, a solution will be found during meta-gaming time, so as soon as the game manager starts asking for moves, we will be able to feed them back as fast as it can consume them. In the recent tournament we were able to find and cache plans during meta-gaming for each of the following:

We were unable to find plans for Hunter or HunterBig, though for entirely different reasons. In Hunter the maximum achievable score is actually 87, though this is not (readily) inferrable from the rules, so although we find a maximum-scoring solution fast, we cannot be sure it is maximal, and so do not cache it (and instead continue to search normally each move). In HunterBig, repeat visits to a previously visited square are allowed (not terminal), and this dramatically increases the size of the search space (and prevents greedy rollout cutoffs), so we simply fail to find an optimal solution at all.

**Target States**

In some games it is possible to identify the state you need to reach to achieve a solution by analysis of the propnet. Specifically, it is possible to generate the set of all such states by fixing the winning goal value proposition to true and flowing backward through the propnet feeding that goal proposition, until base proposition are reached on all paths (encountering an OR [or an AND when the output is required to be false] causes the search to branch, generally leading to N solutions for an N-input OR).

For most games this is impractical since either the size of the set of solution states is extremely large (there are many possible solution states, only some of which may be reachable from the initial state of course), or the size of the set of interim propnet states encountered during the backward search blows up.

However, for some games it allows a state, or small set of states to be identified fairly easily. 8-puzzle (and the family of sliding block puzzles generally) is one such, where there is exactly one solution configuration by construction.

Given a target state and a measure of distance between states, simple A* search can be used to identify the optimal path (strictly the distance measure must be what is known as an 'allowable heuristic', which basically means it never over-estimates the number of moves needed to get between the two states). Using A* with a non-admissible heuristic will still work, but may examine many more intermediate states during the search, and may not find the shortest possible path, if multiple paths exist.

In situations where our meta-gaming MonteCarlo simulation shows no differentiation in scores of playouts (everything scores 0, or at least the same), and we can identify a target state (and a couple of other minor conditions necessary for there to be a decent chance of having a half-decent distance-between-states metric), we attempt to solve using A* during meta-gaming. If we fail we fall back on regular search. If we succeed we cache the plan found.

The state distance heuristic we use is just Hamming distance of the base proposition set (i.e. - how many of the base propositions are different between the states). This is an admissible heuristic for 8-puzzle.

To reduce the search space further in such cases, we also strip out any base propositions which are independent of the moves made (i.e. the control logic, and in particular the step counter), which means that moves which return to a previous state (apart from the step counter - so previous configuration of the tiles in 8-puzzle) are not considered. Strictly speaking, this leaves us solving only an approximation of the 'real' game, but it would require a fairly perverse goal specification for this to be a bad approximation. Note that we already had analysis to identify the control logic, as something similar is done to take those aspects out of the propnet prior to factorization attempts.

With the above it typically takes us a little under a second to solve 8-puzzle.

Extending this to 15-puzzle is somewhat more of a challenge (and on my list of things to look into). In particular this will require:

- Switching from A* to Iterated Deepening A* almost certainly (to keep memory usage within reasonable bounds)
- Identification of a much better heuristic than Hamming distance. If anyone is interested in sliding block puzzle generally, a good summary of solution methods may be found here. Identifying Manhattan distance from propnet analysis (or statistical analysis of simplified games) should be pretty easy, but although it's considerably better than Hamming distance, Manhattan distance will not be good enough to solve 15-puzzle in reasonable time. I plan to look into attempts to create small pattern databases on the fly from solutions to sub-games, but that's currently on the future-experiments list.

As an aside, I plan to experiment with removal of the control logic (for state comparison purposes) in puzzles generally. I suspect it would make the difference to finding a perfect solution in hunterbig for example.

**Goal Latches**

A goal proposition, g is said to be positively latched by a base proposition, b, if g can never become false once b becomes true. Similarly a goal proposition, g', is said to be negatively latched by a base proposition, b', if g' can never become true once b' becomes true.

Analysis of the propnet can often reveal such latches. In some cases very simplistic analysis. If latched goals are identified, then any state in which a goal becomes positively latched can be considered terminal for search purposes (the goal value is fixed in all successor states, so no point in searching further). Similarly if the 'winning' goal becomes negatively latched (and you're only interested in 'wins').

Very simple analysis of this nature yields immediate solutions for simple games like a couple of those that featured (or tried to in one case) in the recent tournament:

- untwisty coridoor (this has no visualization, but basically its a 'maze' where you seal your fate if you step off a single safe path)
- untistycomplex2 - this is a much more subtle version of the above, which addresses some issues with the simple version (it is trivially solvable by transposition analysis, even if you know nothing about latches). Unfortunately the GDL they tried to use was broken. I have provided the Stanford guys with a corrected version (which Sancho duly solves), but as of this writing the Stanford repository has not been updated with it to my knowledge, so I cannot provide a link

**Move choice partitioning**

Sudoku presents an interesting problem. No target state is identifiable (finding one would amount to knowing a solution in fact!), the branching factor is huge, and no score signals are generated except by the (unique usually, but certainty very sparse) solution itself. For these reasons, any statistical search that doesn't somehow radically trim the search space is doomed from the outset.

Fortunately, the encoding of Sudoku makes the constraints implicit in the definition of the legals - that is it is only legal to play moves that do not immediately break one of the no-duplicates-in-a-row-or-column-or-square constraints. Having these constraints implicit in the encoding makes the problem a lot easier than it would otherwise be.

Several things become apparent from fairly simple analysis of the propnet:

- The 100-score goal is only achievable if no cell is marked with an X in the terminal state (that is you've managed to apply legal numeric moves to very cell). That is to say some set X' of base props disable the solution, so in any solution state, S, S^X' = 0
- All cell state base propositions apart from those that mark the cells with an X are themselves latches (that is to say, once a cell is numerically marked it can never become unmarked)
- All X-markings are negative latches (once unset they cannot become set)
- The moves upon which X markings of cells are directly dependent (i.e. - those moves that can be reached from a X marking proposition by backward chaining through the propnet, without crossing a transition in the propnet apart from the base prop concerned's own transition) form a partitioning of the set of all moves. the equivalence sets of this partitioning are exactly the moves which change a single cell. Because of the negative latch nature of the X markings and the requirement that no X marked cells remain in the solution state, at least one move must be played from each partition.

In any game which exhibits the above properties, in any given turn it is only necessary to consider choices from one of the partitions of the move space. The ordering is arbitrary, but search is reduced by choosing the most constrained (smallest intersection of partition and current legal moves) at any given tree node. In Sudoku this means we'll consider the possibilities for the cell with the fewest legal numeric options still available to mark it with (often a single choice) in each state encountered during the search. The result is actually a very small search tree!

We implement all of the above with what we call state machine filters, which act as indirection layers through which isTerminal(), getGoal() and getLegalMoves() calls pass (and some latch-related queries which are not directly relevant here). Such a filter can reduce the set of legal moves that the underlying state machine generates and provide a subset to the search algorithm running above it (the most constrained partition in the above analysis in the case of the move-partitioning filter). It can also modify terminality and goal values as seen by the search algorithm, which is necessary for other uses (which will be the subject of a future post) not relevant to the Sudoku solve.

In general, given game G with generated state machine S, the concept of state machine filters allows us to apply the results of analysis (typically during meta-gaming) to the state machine to transform it from S to S' for the purposes of search. Some filters preserve game semantics accurately (the move choice partitioning one does for example), so G' = G even though S' != S. However, filters which transform the game itself (and thus have G' != G) are also of interest. Such a filter would cause you to solve an approximation of the actual game - this may be a good enough heuristic in itself, or may be part of a mechanism to apply 'strategic' guidance to a full search in G. By architecting the filters as a separate layer we can isolate the analysis and its use without requiring either the underlying state machine implementation to change, or the overlying search algorithm to change (once they are enabled to cope with the insertion of filters generically in the first place)

**Pseudo Puzzles**

Some (contrived!) N-player games are just N 1-player games bolted together (Dual Hunter for example). In such cases it is not necessary to search the combined game tree which includes the moves of other players, since their board states are irrelevant to our goals.

Sancho detects this by a dependency analysis on the goal propositions of its role, and if it finds that other roles are irrelevant (that is to say, their moves cannot impact our score), it entirely replaces the sub-game in which that other role is involved, with a sub-game that has exactly one move, which has no impact on any state. It also flags the game as a pseudo-puzzle (if only its own role is left after replacing irrelevant ones), which enables plan-caching. This is safe, because the planned sequence cannot be impacted by anything other roles do, and hence can be replayed regardless of their moves.

The net effect is that Sancho searches a greatly reduced search space. Multiknightstour, which is 2 10X10 knight's tour puzzles bolted together in this way, was played on the second day of the 2014 championships, but sadly not in Sancho's bracket, as it would have benefited significantly from this analysis.

A related case occurs in Chinook, which is essentially two games of checkers, played on the odd and even squares simultaneously. In the Stanford version of Chinook (which is different to the base version, as found on Tiltyard) one player scores for captures they make on the even board, and the other for captures they make on the odd board. This actually makes the board your role does NOT score on totally irrelevant to your own goals. Since Sancho successfully factors Chinook into two independent games, it actually goes further in this case and doesn't bother searching one tree at all. Instead it just noops in that tree (which is rather unfortunate for its opponent given the way they score in Chinook!). I should note that I think Stanford Chinook is NOT a good game for tournaments in future unless the elective noop is removed. Even then I think that base.Chinook is a better factorized game test for competition purposes.

Subscribe to:
Posts (Atom)