# Unsupervised learning shows which senators vote alike

published this on

Data Science Programming

On this article, I get into details of how to write a program capable of grouping Brazilian senators according to their votes. All the way from getting the necessary info, going through some code and math, and finally getting to the Senate's inner circles.

## Introduction

Brazilian's legislative scenery is complex and varies now and then. Basic comprehension of how our representatives behave is usually challenging even to the most well-informed of us. To the ordinary citizen, this information may not arrive, and if it does, it's usually uninterpretable or too prepared.

Luckily, law enforces our government's transparency and accountability; that valuable information can be found on online portals like the one about the Senate. Don't get too excited, though, as information access is another right we've been struggling to wholly get in Brazil.

I decided to start this activity during internship at Infosimples in order to extract impartial yet meaningful conclusions from politicians' votes. It was worth the effort, since it put under test some concepts seen at Udacity's Intro to Artificial Intelligence (well, it also brought up curious readers like you! Leave some feedback later).

In machine learning (a subfield of AI), a possible approach is the unsupervised learning. Unlike supervised learning, the program knows nothing (about the input data) that would help it make decisions, like rules to follow or patterns to look for.

Neither does it receive an evaluation on its past decisions, as is the case of reinforcement learning. It's like giving your company's spread sheets to the new intern, with no instructions and refusing to tell them if they're doing it right. Except here it works.

Clustering is a well-known form of unsupervised learning, and it aims at splitting a bunch of objects into groups (clusters) that bring together the most similar among them. An algorithm capable of doing that is k-means, which was part of my virtual classes on AI, and will be explained further.

## Data on the votings

As mentioned at the start of the article, our Senate provides data on its votings (and other stuff) through its Open Data portal. The type of dataset we'll work with is Nominal votings.

For this article in particular, I got data on the votings during 2013 (click to download), but it should work with any of the annual listings. It comes in XML format, which is open-data friendly (open standards, non-proprietary and structured) – thankfully, as I'd need to write a whole saga here if we had only PDF's.

Although it does not come with a detailed description of the proposals under voting (usually, just a short paragraph), this XML brings all the votings on the year and who took part on each. The big drawback here is that some votings may be taken in secrecy, leaving us with no proof if a senator voted for approval or not of a proposal.

In the following analysis, we accounted only for open votings (i.e., no secret ones). For the year of 2013, that's only 76 out of 180. (Oh, if only we knew about the other 104...)

## Parsing the data

As one can see from opening any of the listed XML files, there's much information we won't need for our purposes (I'm sorry if it slowed down your browser... unless you're using IE, in which case it was unavoidable and well-deserved).

Using a preset environment with Ruby on my computer, I managed to simplify this data to what's vital and save it on a more convenient format. For that I used a library for Ruby called Nokogiri. It's easy to install and use, since it searches through CSS3 (for HTML files) or XPath (XML, our case) selectors.

The XML file we'll work with has the following structure:

Each <VotoParlamentar> node ("Parliamentarian Vote") brings us properties including the name, an identifier code and what was this senator's vote. Node <Votacao> ("Voting") has properties that I used to distinguish each voting (like a code), but also a <Secreta> ("Secret") node, whose value is a single character indicating if it's a secret voting ("S", for "yes") or not ("N"). If you wish, "S" might also stand for *sigh*, which is my expression when facing secret votings.

open_votings = []
doc.xpath("//Secreta").each do |node|
open_votings << node.parent if node.text.match(/S/).nil?
end

In order to register each vote, I built up two Hashes (Ruby's sets of key-value pairs). The first has all senators' codes as keys – each key points to a new Hash (representing the senator), whose keys are "name" alongside the codes for each voting the senator attended. See this example:

The second Hash is just the opposite: each voting identifier is a key, and each points to a new Hash (representing the voting), whose keys are the codes from senators who attended that voting. In the end, this second structure wasn't necessary, but still nice to have around.

senators, votings = {}, {}
open_votings.each do |ov|
voting = {}
id_ov = identifier(ov) # method that creates a unique code for this voting

ov.xpath("Votos").children.each do |pv| # Parliamentarian vote
vote = in_favor?(pv.xpath("Voto")) # method gets boolean value of the vote
pc = pv.xpath("CodigoParlamentar").text # Parliamentarian code
if senators[pc].nil?
senators[pc] = {name: pv.xpath("NomeParlamentar").text}
end
senators[pc][id_ov] = vote

voting[pc] = vote
end
votings[id_ov] = voting
end

Next came the writing of those Hashes into JSON files (predictable choice for a Ruby programmer, but pick whatever floats your boat), so that we have all this data ready for use whenever needed.

## K-means

This algorithm, despite not being proper for some clustering cases, is easy to implement and quite popular. You can see an example of its use on a set of 2D points below. The name comes from the fact it divides objects into a number of k groups, and it does that taking the means of objects' properties.

Example of k-means with k = 2. We start at (a) with a set of unclassified objects (green dots). At (b) we pick two random dots in the 2D space (red "x" and blue "x"). We check (c) which green dots were closer to red or blue and so assign them their new color. Now at (d) we calculate the centroid (average coordinates) of each cluster and move the respective "x" into its centroid. We update (e) the dots that are now closer to a different centroid than before (part of the red ones became blue and vice versa). At (f) we recalculate the centroids, and realise every dot will remain on their clusters. Since no change was made, the centroids don't change and neither will the dots ever again. There's nothing else to do: the algorithm has converged :D

Unlike the example above, where every dot has two coordinates, our task of grouping senators according to their votes has two problems:

1. The amount of dimensions we're working with is quite bigger (one for each voting, getting up to 76). This way iterations take more time (and no graphical visualization can be made over it).
2. Not every senator (I dare to say, not even one) has attended and voted for each one of the votings, which leaves many "undefined coordinates". This messes up the usual concept of distance, as we'll see up ahead.

Remember the intern with no training I told you about? Well, I guess I was being exxagerated, since k-means needs a few basic tips in order to work well. I can actually imagine a generic program that wouldn't need the following specifications, but that is another story and shall be told another time.

### Distance

A basic requirement for k-means is the existence of a function that computes the distance between any two objects we wish to put into clusters. The recommendation is to use an euclidian distance (ye old formula, $$d^2 = (x^a_0 - x^b_0)^2 + (x^a_1 - x^b_1)^2 +{ }...{ }+ (x^a_{n-1} - x^b_{n-1})^2$$, where and b are two n-dimensional points), but that wouldn't work here. In the case of senators and their votes, I wrote a formula that recollects to the Laplace Smoothing seen at Udacity's AI course.

$$prob = { sim + 1 \over com + 2 }$$

Where prob estimates the probability of two senators belonging to the same cluster, given their similarity (sim) in votes and the ammount of votings they both attended (com).

Parameter com can be easily obtained when listing the common keys to both senators' Hashes, while parameter sim increases (+1) as we find common votes to be equal for both senators. Therefore, the probability is always greater than zero and less than one. Whenever two senators have no common votings, the probability becomes 0.5 (it can't decide if they belong to the same cluster yet).

For the purpose of having a definition of distance, we keep it between zero and one simply by using:

$$dist = 1 - prob$$

Be aware that, even with an unquestionable formula (feel free to improve/change mine), k-means can't guarantee by itself that we'll find the best global solution (only local). Yet, (spoilers) the ending results are satisfactory.

### Centroid

Another requirement for k-means is to have a way of calculating each cluster centroid, whose coordinates are the average of every object inside the cluster. With that in mind, I had to swap the true/false values for decimal numbers 1.0 and 0.0, respectively (a short Ruby method was entrusted with that brave task).

This way, each coordinate of a centroid becomes the percentage of affirmative votes in a voting. E.g.: in a voting where 5 senators of a cluster said YES, and 3 others said NO, the value of this coordinate within this centroid will be $$5 \over 5+3$$, or 0.625.

### Algorithm

The last requirement to mention is that k-means needs us to pick a value for k, the number of clusters in which objects must be distributed (a lot of people involved with machine learning despise that downside ). We start here with the case for k = 2, meaning we'll try to split the Senate in two pieces.

The algorithm (fully and finally) is described now: first, we pick two random points, one for each cluster (given k). We make up a couple of random factors x, with $$0 < x < 1$$, for each of the votings (coordinates). These become the startup centroids (in case they're each at the extreme-right and extreme-left, it's probably just a terminology coincidence :P).

Next, we proceed with an iteration: for each of the senators, we calculate their distance to each cluster centroid and assign the senator to the one with shorter distance. The iteration ends up by calculating the new centroid for each cluster (since new senators end up in the clusters).

We repeat that iteration, which may end up with a different set of clusters. If nothing changes when compared to the previous iteration (the senators remain together in the same clusters), it means the algorithm has converged and there's no point insisting on it.

For bigger values of k, we kick off with more clusters (by creating more arbitrary centroids), and we must since ensure no cluster becomes empty throughout the iterations. In case this happens, a new arbitrary centroid must be calculated to replace it, and a new iteration follows.

Since we can't guarantee the best global solution, our procedure is to register (in a file) every result we get with the algorithm's conversion. The results are appended to the file dozens, hundreds of times. Each time, k new centroids are randomly used, and the results (probably) vary in the end. With all convergence results registered, I can later check which was the most common result.

## Results

As indicated by the following table, we ran the algorithm a thousand times for each value of k listed. Last two columns refer to the best candidate to global solution (the most common result).

 k algorithm runnings number of occurences of the most common result relative frequency 2 1000 317 31.7% 3 1000 175 17.5% 4 1000 6 0.6% 5 1000 2 0.2%

As k increases, the chance of getting repeated results falls. I decided to ignore senators with less than 10 open votings (I actually googled their names to understand such absence), since many convergences left them in an isolated group (specially for k > 3), while the remaining senators got sorted out in k - 1 clusters.

Also, during the runnings, I could notice an increase in how many iterations were necessary to converge (in average) as k grew. The running time for a thousand trials went from a little over a minute (k = 2) to over an hour (k = 5, in which case only one result reappeared).

One might say the way I implemented the algorithm wasn't proper to bigger values of k. I speculate this was mostly due to the chosen formula for distance, which is not euclidian as most references say it should be for k-means.

For the curious ones, though, check this: through these identities (which I couldn't even think of by myself), we see that, among 85 senators, the possible ways to cluster them is absurdly huge. For k = 2, it goes around the power of $$10^{25}$$. For k = 3$$10^{39}$$. For k = 4$$10^{50}$$. And for k = 5$$10^{57}$$. Therefore, even for very small percentages, getting a single repeated result is too rare for us to take it as a simple coincidence.

See below how a classroom full of senators would end up if the teacher asked them to split into...

k = 2

 Acir Gurgacz Aloysio Nunes Ferreira Alfredo Nascimento Alvaro Dias Ana Rita Ana Amélia Angela Portela Ataídes Oliveira Anibal Diniz Aécio Neves Antonio Carlos Rodrigues Blairo Maggi Antonio Carlos Valadares Cristovam Buarque Armando Monteiro Cyro Miranda Benedito de Lira Cássio Cunha Lima Casildo Maldaner Cícero Lucena Ciro Nogueira Fernando Collor Clésio Andrade Flexa Ribeiro Delcídio do Amaral Jarbas Vasconcelos Eduardo Amorim Jayme Campos Eduardo Braga José Agripino Eduardo Lopes Lúcia Vânia Eduardo Suplicy Mário Couto Epitácio Cafeteira Osvaldo Sobrinho Eunício Oliveira Paulo Bauer Francisco Dornelles Pedro Simon Gim Pedro Taques Humberto Costa Randolfe Rodrigues Inácio Arruda Roberto Requião Ivo Cassol Ruben Figueiró Jader Barbalho Waldemir Moka Jorge Viana Wilder Morais José Pimentel José Sarney João Alberto Souza João Capiberibe João Durval João Ribeiro João Vicente Claudino Kátia Abreu Lindbergh Farias Lobão Filho Luiz Henrique Lídice da Mata Magno Malta Mozarildo Cavalcanti Paulo Davim Paulo Paim Ricardo Ferraço Rodrigo Rollemberg Romero Jucá Sérgio Petecão Sérgio Souza Valdir Raupp Vanessa Grazziotin Vicentinho Alves Vital do Rêgo Walter Pinheiro Wellington Dias Zeze Perrella

k = 3

 Acir Gurgacz Alfredo Nascimento Aloysio Nunes Ferreira Ana Rita Antonio Carlos Rodrigues Alvaro Dias Angela Portela Armando Monteiro Ana Amélia Anibal Diniz Ciro Nogueira Ataídes Oliveira Antonio Carlos Valadares Clésio Andrade Aécio Neves Benedito de Lira Eduardo Braga Blairo Maggi Casildo Maldaner Eduardo Lopes Cristovam Buarque Delcídio do Amaral Eunício Oliveira Cyro Miranda Eduardo Amorim Francisco Dornelles Cássio Cunha Lima Eduardo Suplicy Ivo Cassol Cícero Lucena Epitácio Cafeteira Jader Barbalho Flexa Ribeiro Fernando Collor José Sarney Jarbas Vasconcelos Gim João Alberto Souza Jayme Campos Humberto Costa João Vicente Claudino José Agripino Inácio Arruda Lobão Filho Lúcia Vânia Jorge Viana Luiz Henrique Mário Couto José Pimentel Mozarildo Cavalcanti Paulo Bauer João Capiberibe Romero Jucá Pedro Simon João Durval Sérgio Petecão Pedro Taques João Ribeiro Valdir Raupp Ruben Figueiró Kátia Abreu Vanessa Grazziotin Wilder Morais Lindbergh Farias Zeze Perrella Lídice da Mata Magno Malta Osvaldo Sobrinho Paulo Davim Paulo Paim Randolfe Rodrigues Ricardo Ferraço Roberto Requião Rodrigo Rollemberg Sérgio Souza Vicentinho Alves Vital do Rêgo Waldemir Moka Walter Pinheiro Wellington Dias

k = 4

 Acir Gurgacz Aloysio Nunes Ferreira Alvaro Dias Alfredo Nascimento Ana Rita Ataídes Oliveira Ana Amélia Antonio Carlos Rodrigues Angela Portela Ivo Cassol Aécio Neves Armando Monteiro Anibal Diniz Jader Barbalho Blairo Maggi Ciro Nogueira Antonio Carlos Valadares Jarbas Vasconcelos Cristovam Buarque Clésio Andrade Benedito de Lira Jayme Campos Cyro Miranda Eduardo Braga Casildo Maldaner Luiz Henrique Cássio Cunha Lima Eduardo Lopes Delcídio do Amaral Paulo Bauer Cícero Lucena Eunício Oliveira Eduardo Amorim Wilder Morais Flexa Ribeiro Francisco Dornelles Eduardo Suplicy José Agripino José Sarney Epitácio Cafeteira Lúcia Vânia João Alberto Souza Fernando Collor Mário Couto João Vicente Claudino Gim Osvaldo Sobrinho Lobão Filho Humberto Costa Pedro Simon Mozarildo Cavalcanti Inácio Arruda Pedro Taques Romero Jucá Jorge Viana Randolfe Rodrigues Sérgio Petecão José Pimentel Roberto Requião Valdir Raupp João Capiberibe Ruben Figueiró Vanessa Grazziotin João Durval Waldemir Moka Zeze Perrella João Ribeiro Kátia Abreu Lindbergh Farias Lídice da Mata Magno Malta Paulo Davim Paulo Paim Ricardo Ferraço Rodrigo Rollemberg Sérgio Souza Vicentinho Alves Vital do Rêgo Walter Pinheiro Wellington Dias

k = 5

 Acir Gurgacz Aécio Neves Aloysio Nunes Ferreira Alfredo Nascimento Alvaro Dias Ana Rita Cássio Cunha Lima Ataídes Oliveira Antonio Carlos Rodrigues Ana Amélia Angela Portela Cícero Lucena Ivo Cassol Armando Monteiro Blairo Maggi Anibal Diniz Epitácio Cafeteira Jader Barbalho Ciro Nogueira Cristovam Buarque Antonio Carlos Valadares Flexa Ribeiro Luiz Henrique Clésio Andrade Cyro Miranda Benedito de Lira Jarbas Vasconcelos Paulo Bauer Eduardo Braga Jayme Campos Casildo Maldaner José Agripino Wilder Morais Eduardo Lopes Lúcia Vânia Delcídio do Amaral Vital do Rêgo Eunício Oliveira Mário Couto Eduardo Amorim Francisco Dornelles Osvaldo Sobrinho Eduardo Suplicy José Sarney Pedro Simon Fernando Collor João Alberto Souza Pedro Taques Gim João Vicente Claudino Randolfe Rodrigues Humberto Costa Lobão Filho Roberto Requião Inácio Arruda Mozarildo Cavalcanti Ruben Figueiró Jorge Viana Romero Jucá Waldemir Moka José Pimentel Sérgio Petecão João Capiberibe Valdir Raupp João Durval Vanessa Grazziotin João Ribeiro Zeze Perrella Kátia Abreu Lindbergh Farias Lídice da Mata Magno Malta Paulo Davim Paulo Paim Ricardo Ferraço Rodrigo Rollemberg Sérgio Souza Vicentinho Alves Walter Pinheiro Wellington Dias

Our analysis for lesser values of k turned out to be pretty concrete, allowing us to perceive the main two or three ways of politics in our Senate. And that's without recurring to any rules, ideologic or partidary concepts, nor the votings' themes - simply through the similarity of registries.

What you do based on that information is up to you. You can find out much more just through the XML we worked with, imagine with all the other sources? You might also want to check some tools developed by Radar Parlamentar, which are way more elaborate and also served as inspiration for this activity. If you have any doubt or something to add up here, please comment or leave me a message! Thanks for being curious and patient