The use of reinforcement learning to program the artificial intelligence of games probably has as one of its most impressive achievements the development of a program that could play backgammon at a professional level by Gary Tesauro from IBM during the 90s. The first attempt was to build a model of the many states of the game based on a collection of examples and try to generalize from those using supervised learning. Besides being a tedious job for the researchers to enumerate hundreds of different states, the program was performing poorly.

Using a different approach, this time with reinforcement learning, the developers made two versions of their game play against each other. At the end of each game, the winning one received a positive reward and the losing one a negative. After 200.000 matches, the agent was able to play at the level of some of the best players in the world, without the need of any kind of professional experience or describing the more than 1 trillion possible states of the game.

In an attempt to recreate the experiment with a simpler game, the idea was to develop an intelligent agent for Tic-Tac-Toe, without the need to write an algorithm with the best strategies or enumerating all the possible states of the game.

Analysing a few matches of Tic-Tac-Toe, it's possible to verify a few of its properties:

- A player can either win, lose or tie;
- Matches between two players that know the best plays always end in ties;
- Many of the different states are equivalent - by rotating the board by 90 degrees and mirroring it, it's is possible to find up to 8 equivalent states.

**Image 1 - Example of equivalent states**

Knowing about the equivalence between some states, it's easier to classify them all and a fewer number of games is needed to visit all the states at least once. Without taking the equivalence into consideration, having nine squares that can be each filled in three different ways (X, O or empty) results in 3^{9} = 19,638 possible states (counting the ones that can never be achieved on a normal match, like all 9 squares filled with circles). Counting only the valid ones and considering the equivalence property, only 765 states are needed to cover the entire game.

Now, two versions of the game should play against each other. Every state each player passed through should receive a score according to the results of the match. For instance, if the first player started by checking the center and eventually won the game, the state that describes the board with only the first player on the center should receive a positive score. Similarly, if the second player marked one of the corners and later lost, the state with the first player on the center and the second player in one of the corners (four equivalent states in this example) should receive a negative score. It was still necessary to create a function that could properly score the states.

Firstly, the score should consider the final result of the match. Winning should be better than tying or losing, and a tie should be better than a loss.

Also, the first and last plays of each player should have different weights, since the last ones define who won or lost.

Lastly, the number of turns each player had until the end of the game should be taken in consideration, because winning on the third play should be better than winning on the fifth.

The score should then be a function of the result *(R)*, at how many turns until the end of the match that state happened *(i)* and how many turns the player had during the game *(d)*.

After testing a few combinations and changing the weight of each variable, a function that had satisfying results was:

where *R* equals 7 on a victory, 2 on a tie and -2 on a loss, *i* varies between 0 and 4 and *d* varies between 3 and 5.

After 100,000 training matches, the agent should now have the ability of choosing a play in every situation that would lead him to a victory or, on the worst cases, a tie.

To test its efficiency, volunteers from our office tried to play against it, with every game ending in either a tie or a win by the program.

To confirm its capability, two versions of the game were again put to play against each other, this time with one playing at random and the other with the acquired knowledge. The results were as expected: after 50,000 matches, the intelligent agent won about 91% of them and tied the other 9%, without losing once, with an average of ten wins every eleven games.

##### Starting testing TicTacToe with 50000 games

done with 5000 games @ 14.980499 seconds: wins=>4533, ties=>467, losses=>0

done with 10000 games @ 29.581513 seconds: wins=>9081, ties=>919, losses=>0

done with 15000 games @ 45.025361 seconds: wins=>13639, ties=>1361, losses=>0

done with 20000 games @ 59.323242 seconds: wins=>18215, ties=>1785, losses=>0

done with 25000 games @ 73.78854 seconds: wins=>22771, ties=>2229, losses=>0

done with 30000 games @ 88.776816 seconds: wins=>27318, ties=>2682, losses=>0

done with 35000 games @ 109.011402 seconds: wins=>31877, ties=>3123, losses=>0

done with 40000 games @ 126.042564 seconds: wins=>36422, ties=>3578, losses=>0

done with 45000 games @ 141.150789 seconds: wins=>40951, ties=>4049, losses=>0

done with 50000 games @ 156.276316 seconds: wins=>45494, ties=>4506, losses=>0

##### Starting testing TicTacToe with 50000 games

done with 5000 games @ 17.896717 seconds: wins=>4553, ties=>447, losses=>0

done with 10000 games @ 35.501926 seconds: wins=>9087, ties=>913, losses=>0

done with 15000 games @ 52.823191 seconds: wins=>13648, ties=>1352, losses=>0

done with 20000 games @ 68.256094 seconds: wins=>18178, ties=>1822, losses=>0

done with 25000 games @ 83.470117 seconds: wins=>22733, ties=>2267, losses=>0

done with 30000 games @ 99.914703 seconds: wins=>27258, ties=>2742, losses=>0

done with 35000 games @ 117.766631 seconds: wins=>31788, ties=>3212, losses=>0

done with 40000 games @ 135.225512 seconds: wins=>36350, ties=>3650, losses=>0

done with 45000 games @ 156.867067 seconds: wins=>40899, ties=>4101, losses=>0

done with 50000 games @ 174.201975 seconds: wins=>45441, ties=>4559, losses=>0

**Conclusion**

Machine learning techniques have huge versatility, being found in many different fields of knowledge. Capable of generalizing complex cases or finding methods to deal with situations that are only superficially known, it's expected that it's applications open the way to improvements not only on robotics or computing, but also on fields like health or entertainment.