Sunday, March 13, 2016

Alphago probably isn't learning from Lee Sedol

There has been quite a bit of discussion about whether Alphago can learn from the games it plays against Lee Sedol. I think not. At least, not directly. 
The heart of the program is the “policy network” a convolutional neural network (CNN) that was designed for image processing. CNNs return a probability that a given image belongs to each of a predefined set of classifications, like “cat”, “horse”, etc. CNNs work astonishingly well, but have the weakness that they can only be used with a fix size image to estimate a fixed set of classifications.
The policy network views go positions as 19×19 images and returns probabilities that human players would make one of 361 possible moves. This probability drives with the Monte Carlo tree search for good moves that has been used for some time in go computers.The policy network is trained on 30 million positions (or moves) initially. 
CNN (aka “deep learning”) behavior is pretty well understood. The number of samples required for learning depends on the complexity of the model. A model of this complexity probably requires tes of thousands of example positions before it changes much. 
The number of samples required to train any machine learning program depends on the complexity of the strategy, not on the number of possible positions. For example, Gomoku ("five in a row", also called goban) on a 19×19 board would take many fewer examples to train than go would, even though the number of possible positions is also very large.
Another point: Any machine learning algorithm will eventually hit a training limit, after which it won’t be able to improve itself by more training. After that, a new algorithm based on a new model of game play would be required to improve the play. It is interesting that the Alphago team seems to be actively seeking ideas in this area. Maybe that is because they are starting to  hit a limit, but maybe it's just because they are looking into the future.
So Alphago probably can’t improve its play measurably by playing any single player five times, no matter how strong. That would be “overfitting”. The team will be learning from the comments of the pro players and modifying the program to improve it instead.
Interesting tidbit: Alphago said the chances of a human playing move 37 in game 2 was 1 in 10,000. So the policy network doesn’t decide everything.

Alphago is a learning machine more than a go machine

The key part of Alphago is a convolutional neural network. These are usually used for recognizing cat pictures and other visual tasks, and progress in the last five years has been incredible.
Alphago went from the level of a novice pro last October to world champion level for this match. It did so by playing itself over and over again.
Chess programs are well understood because they are programmed by humans. Alphago is uses an algorithm to pick a winning move in a given go position. But the heart of the program is a learning program to find that algorithm, not the algorithm itself.
Go programs made steady progress for a decade with improved tree pruning methods, which reduce the total number of positions the program has to evaluate. The cleverest method is Monte Carlo pruning, which simply prunes at random.