Read the latest articles about Programming at

Unsupervised learning shows which senators vote alike
Gravatar published this on

Data Science Programming

On this article, I get into details of how to write a program capable of grouping Brazilian senators according to their votes. All the way from getting the necessary info, going through some code and math, and finally getting to the Senate's inner circles.


Brazilian's legislative scenery is complex and varies now and then. Basic comprehension of how our representatives behave is usually challenging even to the most well-informed of us. To the ordinary citizen, this information may not arrive, and if it does, it's usually uninterpretable or too prepared.

Luckily, law enforces our government's transparency and accountability; that valuable information can be found on online portals like the one about the Senate. Don't get too excited, though, as information access is another right we've been struggling to wholly get in Brazil.

I decided to start this activity during internship at Infosimples in order to extract impartial yet meaningful conclusions from politicians' votes. It was worth the effort, since it put under test some concepts seen at Udacity's Intro to Artificial Intelligence (well, it also brought up curious readers like you! Leave some feedback later).


Artificial Intelligence for Tic-Tac-Toe
Gravatar published this on

Data Science Programming

The use of reinforcement learning to program the artificial intelligence of games probably has as one of its most impressive achievements the development of a program that could play backgammon at a professional level by Gary Tesauro from IBM during the 90s. The first attempt was to build a model of the many states of the game based on a collection of examples and try to generalize from those using supervised learning. Besides being a tedious job for the researchers to enumerate hundreds of different states, the program was performing poorly.

Using a different approach, this time with reinforcement learning, the developers made two versions of their game play against each other. At the end of each game, the winning one received a positive reward and the losing one a negative. After 200.000 matches, the agent was able to play at the level of some of the best players in the world, without the need of any kind of professional experience or describing the more than 1 trillion possible states of the game.   

In an attempt to recreate the experiment with a simpler game, the idea was to develop an intelligent agent for Tic-Tac-Toe, without the need to write an algorithm with the best strategies or enumerating all the possible states of the game.


String Segmentation using Machine Learning
Gravatar published this on

Data Science Programming

Artificial Intelligence has brought lots of improvements to problems related with language processing, especially in the field of language translation. Other examples include suggestions to correct misspelled words or even speech recognition.

To test a simple algorithm that works with elements of a language, the developed program had the objective of being able to split a string without spaces into the words that compose that string. For instance, receiving the string "thisishowitshouldwork" should result in "this is how it should work".

The problem consists in being able to identify words, since it's impossible to list all the known english words and there is no model that can correctly generalize the structure of every word.

The approach was to use supervised learning and a database of texts as training data so that the agent could start to identify the different words that exist in the english language. By utilizing texts instead of just a list of thousands of different words, it's possible to identify the frequency in which every word tend to appear. Knowing the probability of each word showing up, the agent should be able to calculate the most probable segmentation of every string.

It's viable to say that the probability of a certain segmentation depends on the probabilities of each word obtained by that segmentation. By considering the words that show up before the one we are analysing, the results could be even more accurate but, with the relatively small quantity of training data available, it was only taken in consideration the individual probability of each word.

Another point that has to be taken into consideration is how to deal with words absent from the training data. It's impossible to say that any training database contain every known word, no matter how big it its. This issue was adressed by using Laplacian Smoothing to avoid unknown probabilities to being equal to zero. At last, the calculations should be normalized so it's possible to compare segmentations between a different number of words.

The probability of a single word can be expressed by:

where n is the number of times the word appeared on the training data, T the total of analysed words, k is a constant and V is the estimated total number of words on the english vocabulary. On this problem it was used k = 1 and V = 2,000,000.

The probability of a segmentation can be expressed by: 

where p1, p2, ... pn are the possible words obtained on a segmentation.

Knowing how the calculations should be done, it's necessary to deal with the problem of hardware limitations. A string with n characters has 2n-1 ways of being split, which would result in over one million calculations just to identify the best segmentation of the string "thisishowitshouldwork".

To deal with this problem, it's possible to adjust the algorithm to always split the string in two pieces, f (first part) and r (rest). If the probability of f is X times bigger than r's, the agent has likely found a good segmentation in f, and should try to segment r as well.


Read more about: