In this article we will discuss about the process of machine learning through game playing.

One of the earliest game-playing programs created by Samuel in 1963 learned to play checkers well enough to beat its creator. Samuel’s program began as novice, but after only a few days’ self-play was able to complete on equal terms in some very strong human tournaments.

The characteristics of checkers which made it an ideal test bed for the first learning program include:

1. The activity must not be deterministic in the practical sense. Samuel computed the number of choices of moves to explore every complete path in checkers as 10. At 3 choices per nanosecond this would require 10 centuries of computer time.

ADVERTISEMENTS:

2. A definite goal must exist and at least one measure for determining progress towards the goal must exist. The goal is to win and an obvious measure of progress is the number of pieces of each colour on the board.

3. The rules of the activity must be definite and they should be known.

4. There should be a background of knowledge concerning the activity against which the learning progress can be tested.

5. The activity should be one which is familiar to a substantial body of people so that the behaviour of the program can be made understandable.

ADVERTISEMENTS:

The first task was to incorporate the rules of the game. Next, Samuel generated possible moves down the game tree to a minimum number of ply and evaluated each of the resulting board positions with a scoring polynomial (static evaluation function).

His original evaluation function was a linear polynomial of four terms:

(1) Piece advantage,

(2) Denial of occupancy,

ADVERTISEMENTS:

(3) Mobility, and

(4) A hybrid term which combined control of the center and piece advancement.

The most effective look-ahead strategy was based on a quiescent heuristic. The minimum look-ahead distance was generally set at three. Unless one of the following three conditions was satisfied, the evaluation function was applied and the optimum move was selected using an alpha-beta mini-max procedure.

The three conditions causing exceptions to this rule were:

ADVERTISEMENTS:

1. The next move is a jump

2. The last move was a jump, or

3. An exchange offer is possible.

If any of these exceptions apply, look-ahead continued to four ply. At ply 4 the evaluation function was applied if conditions (1) and (3) were not met. At a ply of five or greater, the program stopped the look-ahead whenever the next ply level did not offer a jump. At a ply of eleven or greater, the program terminated the look- ahead, even if the next move was to be a jump, if one side was ahead by more than two kings. The program stopped at 20 plies regardless of all conditions.

ADVERTISEMENTS:

This strategy proved effective for exploring promising paths to a greater depth (to incorporate the horizon effect) and for accommodating the restriction on branching imposed by a jump situation. The actual choice of ply cut-offs was restricted by the memory limitations of the IBM 704 on which the program was implemented. The system had 10,000 words of main memory, magnetic tape for long term storage, and a 0.000001 GH processor. The program still remains one of the great feats of AI.

Rote Learning in Checkers Player:

The first of three learning modes Samuel incorporated in his checkers player is what he called rote learning. Programs which learn by the direct implanting of new knowledge (for example by being reprogrammed or being supplied with new data) are performing no inference at all.

This is usually referred to as rote-learning: a reasonable human analog would be the learning of multiplication tables. By rote learning he means that the computationally expensive results of evaluating a particular board position can be indexed and saved for future reference to perform better.

This is equivalent to memorising the value to the machine of that board position and greatly increases the look-ahead power of the system. If, for instance, board position X has been evaluated to a depth of three ply in a given game and is encountered in a later game at the end of a three ply search from board position Y, the effect is to provide Y with a six ply search through the path containing X. Board position Y might be generated in a later search from position Z and, provide a 9-ply search along the X-Y path and so on.

Consider, for example, the checkers play using mini-max search technique. Due to time constraints it is to search upto a fixed ply depending on the situation. When it finished searching the tree static evaluation function is applied at the board position and propagated backward to the root of the tree Fig. 9.2(a). After a few moves, the situation in the game is represented by Fig. 9.2. (b). Instead of using the static evaluation function to compute a score for position A, it is not recomputed again, instead the stored value is used. This creates the effect of having searched through the equivalent plies.

Rote learning provided another advantage for improving the performance of the checkers players. Samuel discovered that it was necessary to impart a sense of direction to his program to press it toward a win. For instance, from a given board position a large number of paths on the game tree may all lead to a win but with a considerable range on the number of plies required to reach the winning state.

Optimum play from this position requires the shortest path to the win. By storing the play value for each board position it has evaluated Samuel used rote learning to direct the program to push for a win, or, if left behind, to adopt delaying tactics.

A final feature which rote learning provided was the mechanism for forgetting. Each new board position was arbitrarily assigned an age. When a board position was referenced to update its score or use it in the look-ahead process, the age recorded for this position was divided by two. This process was called refreshing the memory. Approximately after every twenty moves each board position was automatically aged by one unit.

When the age of a given position reached an arbitrary maximum value the position was deleted from memory (i.e., forgotten). Forgetting was particularly important for the limited capacity machine Samuel used for the checkers player. Even with machines of virtually unlimited memory capacity, for getting is a valuable concept for increasing the efficiency of computation.

Samuel used several techniques for training the program in the rote learning mode. These included play against itself, play against many different humans including master players, and use of published play by masters (book games).

In his own evaluation of the success of the rote learning mode, Samuel states:

“The program learned to play a very good opening game and to recognise most winning and losing end positions many moves in advance, although its midgame play was not greatly improved. This program now qualifies as a rather better-than-average novice, but definitely not as an expert.”

This performance was based on a memory tape containing approximately 53,000 board positions. Samuel estimated that significant improvement of the midgame play would require storing at least twenty times that many board positions. Although it was infeasible to devote the expensive computers of the day to this task, Samuel noted that the computer would require less time to become a master through rote learning than a human master player would.

Learning by Generalisation:

(a) Linear Evaluation Function:

An obvious way to improve the efficiency of a learning program is to generalise on the basis of experience and save only the generalisations. This corresponds to a higher level of abstraction in representing the inference process. Samuel chose modification of the coefficients of the evaluation polynomial as the representation for this generalisation.

The evaluation function included the piece-advantage terms as before and was revised to include a subset consisting of 16 variable coefficients drawn from a reserve inventory of 38 parameters. The central idea of the learning process was to determine the highest point in the multidimensional scoring space as a function of the 16 coefficients.

In order to generalise on the basis of self-play, the program was structured to act as two different players called Alpha and Beta. Alpha generalised after each move by adjusting the, coefficients of its linear evaluation function and replacing terms which seem to have little importance by new parameters drawn from the reserve inventory. Beta used the same evaluation function for the duration of a game.

After each game, the relative playing ability of Alpha and Beta were compared by a neutral evaluation function. If Alpha won or was ahead when play terminated, Alpha’s evaluation function was transferred to Beta. If Beta won or was ahead, Alpha was given a black-mark. After three black marks, Alpha was assumed to be on the wrong track and its scoring polynomial was drastically revised, usually by reducing the coefficient of the leading term to zero.

The three black marks by Alpha indicated that it had reached a secondary maximum in the hill-climbing process in coefficient space (i.e., a false summit). The drastic revision of the scoring function effectively jumps to a new region in coefficient space in an attempt to continue improving Alpha’s performance.

To determine how much each coefficient should be changed after each move, an “error function” Δ is computed for each position. The function Δ is the difference between a performance standard and Alpha’s estimate of the value of the present board position. The performance standard is computed by a deeper, mini-max. search from the present position using the present Alpha evaluation function.

If Δ is negative, Alpha’s polynomial is overestimating the position value. If Δ is positive, Alpha is underestimating it. By careful tabulation of the features of the present position, a correlation coefficient may be computed which predicts the dependence of on a given feature. Learning is achieved when the existing Alpha polynomial yields the same position value as the performance standard would give, that is, when Δ is minimised.

A large positive correlation coefficient occurs for features which, if not weighted heavily enough, would lead to a large positive Δ value. If the correlation coefficient turns out to be negative, the term is contributing with the wrong sign and a reversal of sign is indicated.

The correlation coefficient also provides a useful tool for identifying terms of little significance or importance in their predictive value. After each move, the term with the lowest correlation coefficient is identified and a “low-term tally” is incremented for this term.

When the low-term tally reaches some arbitrary number (from 8-32) the term is replaced by the top term from the reserve list and added to the bottom of the reserve list. In this way, the least significant terms are eliminated from the evaluation function. The effect of the process is to select the most significant terms from the inventory of parameters.

Samuel concluded that the experiment to learn by generalisation of the evaluation function did prove that learning took place.

He says:

“The program appeared to be approaching the quality of play which caused it to be described as a ‘better-than-average player.’ A detailed analysis of these results indicated that the learning procedure did work and that the rate of learning was surprisingly high, but that the learning was quite erratic and none too stable.”

At the heart of the problem is the fact that the evaluation function used was linear in the terms evaluated. That is, the errors in Alpha’s estimation were attributed to errors in the assigned values of the term coefficients independently. Such a linear evaluation function ignores the very real correlations between parameters.

For instance, the coefficient multiplying the number of kings on the board must depend very strongly on their board position, mobility, number of men on the board and so on. The final learning strategy is an attempt to solve this flaw by introducing correlations between parameters.

(b) Correlated Evaluation Function:

The correlations between n parameters of the evaluation function may be represented by an H-dimensional array which Samuel called a signature table. The axis of each dimension is subdivided into cells, typically three corresponding to + 1 (position good for the program), 0 (position neutral for the program), -1 (position bad for the program). If, for instance, the evaluation function was based on just three parameters, each of which could have three values, the evaluation space would resemble the structure of a Rubik’s cube with a total of 27 cells.

The concept is that each board position may be characterised by assignment to one of these cells. The contents of the cell represent the static evaluation function value for that particular combination of features. To train the system as to what the contents of each cell should be, Samuel used book games recorded between master checkers players to present 2,50,000 board situations to the system.

Each cell j of the signature table has two registers associated with it: Aj for agree and Dj for disagree. For each of the book value moves there is one cell which represents the book value move and a number of other alternatives which were not selected by the master players.

If a particular cell was an allowed cell which was not the cell chosen, the Dj counter was incremented by one. The Aj counter for the cell selected by the master players was incremented by n, the number of non-selected, allowed moves.

The associated correlation coefficient, Cj for cell j is then computed by:

This type of learning can be considered as learning by examples or book learning in which the book values are statistically averaged to provide instruction to the computer on the optimum move for each category of board positions.

Since the number of dimensions in parameter space was 24, Samuel found it necessary to develop a hierarchy of signature tables to provide data compression (otherwise the 24-dimensional table would have required 6 x 1012 cells. Through the inclusion of correlations between evaluation function parameters, the signature table system provided a considerable improvement over both the rote learning technique and the linear evaluation function generalisation.

Although this final version of Samuel’s checkers player played at a championship level, it lost to the 1965 world champion by a score of 0-4. No doubt, the checker playing program written by Arthur Samuel was the first significant application of reinforcement learning there were some significant differences between this program and current methods.

First, he updated the weights using the difference between the current state and the backed up value generated by full look ahead in the search tree. A second difference was that the program did not use any observed rewards.

That is, the values of terminal states were ignored. This means that it is quite possible for Samuel’s program not to converge, or to converge on a strategy designed to lose rather than win. He managed to avoid this fate by insisting that the weight for material advantage should always be positive. Remarkably; this was sufficient to direct the program into areas of weight space corresponding to good checker play.