Unsupervised learning shows which senators vote alike

Gravatar published this on

Data Science Programming

On this article, I get into details of how to write a program capable of grouping Brazilian senators according to their votes. All the way from getting the necessary info, going through some code and math, and finally getting to the Senate's inner circles.

Introduction

Brazilian's legislative scenery is complex and varies now and then. Basic comprehension of how our representatives behave is usually challenging even to the most well-informed of us. To the ordinary citizen, this information may not arrive, and if it does, it's usually uninterpretable or too prepared.

Luckily, law enforces our government's transparency and accountability; that valuable information can be found on online portals like the one about the Senate. Don't get too excited, though, as information access is another right we've been struggling to wholly get in Brazil.

I decided to start this activity during internship at Infosimples in order to extract impartial yet meaningful conclusions from politicians' votes. It was worth the effort, since it put under test some concepts seen at Udacity's Intro to Artificial Intelligence (well, it also brought up curious readers like you! Leave some feedback later).

In machine learning (a subfield of AI), a possible approach is the unsupervised learning. Unlike supervised learning, the program knows nothing (about the input data) that would help it make decisions, like rules to follow or patterns to look for.

Neither does it receive an evaluation on its past decisions, as is the case of reinforcement learning. It's like giving your company's spread sheets to the new intern, with no instructions and refusing to tell them if they're doing it right. Except here it works.

Clustering is a well-known form of unsupervised learning, and it aims at splitting a bunch of objects into groups (clusters) that bring together the most similar among them. An algorithm capable of doing that is k-means, which was part of my virtual classes on AI, and will be explained further.

Data on the votings

As mentioned at the start of the article, our Senate provides data on its votings (and other stuff) through its Open Data portal. The type of dataset we'll work with is Nominal votings.

For this article in particular, I got data on the votings during 2013 (click to download), but it should work with any of the annual listings. It comes in XML format, which is open-data friendly (open standards, non-proprietary and structured) – thankfully, as I'd need to write a whole saga here if we had only PDF's.

Although it does not come with a detailed description of the proposals under voting (usually, just a short paragraph), this XML brings all the votings on the year and who took part on each. The big drawback here is that some votings may be taken in secrecy, leaving us with no proof if a senator voted for approval or not of a proposal.

In the following analysis, we accounted only for open votings (i.e., no secret ones). For the year of 2013, that's only 76 out of 180. (Oh, if only we knew about the other 104...)

Parsing the data

As one can see from opening any of the listed XML files, there's much information we won't need for our purposes (I'm sorry if it slowed down your browser... unless you're using IE, in which case it was unavoidable and well-deserved).

Using a preset environment with Ruby on my computer, I managed to simplify this data to what's vital and save it on a more convenient format. For that I used a library for Ruby called Nokogiri. It's easy to install and use, since it searches through CSS3 (for HTML files) or XPath (XML, our case) selectors.

The XML file we'll work with has the following structure:

Each <VotoParlamentar> node ("Parliamentarian Vote") brings us properties including the name, an identifier code and what was this senator's vote. Node <Votacao> ("Voting") has properties that I used to distinguish each voting (like a code), but also a <Secreta> ("Secret") node, whose value is a single character indicating if it's a secret voting ("S", for "yes") or not ("N"). If you wish, "S" might also stand for *sigh*, which is my expression when facing secret votings.

open_votings = []
doc.xpath("//Secreta").each do |node|
  open_votings << node.parent if node.text.match(/S/).nil?
end

In order to register each vote, I built up two Hashes (Ruby's sets of key-value pairs). The first has all senators' codes as keys – each key points to a new Hash (representing the senator), whose keys are "name" alongside the codes for each voting the senator attended. See this example:

The second Hash is just the opposite: each voting identifier is a key, and each points to a new Hash (representing the voting), whose keys are the codes from senators who attended that voting. In the end, this second structure wasn't necessary, but still nice to have around.

senators, votings = {}, {}
open_votings.each do |ov|
  voting = {}
  id_ov = identifier(ov) # method that creates a unique code for this voting

  ov.xpath("Votos").children.each do |pv| # Parliamentarian vote
    vote = in_favor?(pv.xpath("Voto")) # method gets boolean value of the vote
    pc = pv.xpath("CodigoParlamentar").text # Parliamentarian code
    if senators[pc].nil?
      senators[pc] = {name: pv.xpath("NomeParlamentar").text}
    end
    senators[pc][id_ov] = vote
    
    voting[pc] = vote
  end
  votings[id_ov] = voting
end

Next came the writing of those Hashes into JSON files (predictable choice for a Ruby programmer, but pick whatever floats your boat), so that we have all this data ready for use whenever needed.

K-means

This algorithm, despite not being proper for some clustering cases, is easy to implement and quite popular. You can see an example of its use on a set of 2D points below. The name comes from the fact it divides objects into a number of k groups, and it does that taking the means of objects' properties.


Example of k-means with k = 2. We start at (a) with a set of unclassified objects (green dots). At (b) we pick two random dots in the 2D space (red "x" and blue "x"). We check (c) which green dots were closer to red or blue and so assign them their new color. Now at (d) we calculate the centroid (average coordinates) of each cluster and move the respective "x" into its centroid. We update (e) the dots that are now closer to a different centroid than before (part of the red ones became blue and vice versa). At (f) we recalculate the centroids, and realise every dot will remain on their clusters. Since no change was made, the centroids don't change and neither will the dots ever again. There's nothing else to do: the algorithm has converged :D

Unlike the example above, where every dot has two coordinates, our task of grouping senators according to their votes has two problems:

  1. The amount of dimensions we're working with is quite bigger (one for each voting, getting up to 76). This way iterations take more time (and no graphical visualization can be made over it).
  2. Not every senator (I dare to say, not even one) has attended and voted for each one of the votings, which leaves many "undefined coordinates". This messes up the usual concept of distance, as we'll see up ahead.

Remember the intern with no training I told you about? Well, I guess I was being exxagerated, since k-means needs a few basic tips in order to work well. I can actually imagine a generic program that wouldn't need the following specifications, but that is another story and shall be told another time.

Distance

A basic requirement for k-means is the existence of a function that computes the distance between any two objects we wish to put into clusters. The recommendation is to use an euclidian distance (ye old formula, \(d^2 = (x^a_0 - x^b_0)^2 + (x^a_1 - x^b_1)^2 +{ }...{ }+ (x^a_{n-1} - x^b_{n-1})^2\), where and b are two n-dimensional points), but that wouldn't work here. In the case of senators and their votes, I wrote a formula that recollects to the Laplace Smoothing seen at Udacity's AI course.

\(prob = { sim + 1 \over com + 2 }\)

Where prob estimates the probability of two senators belonging to the same cluster, given their similarity (sim) in votes and the ammount of votings they both attended (com).

Parameter com can be easily obtained when listing the common keys to both senators' Hashes, while parameter sim increases (+1) as we find common votes to be equal for both senators. Therefore, the probability is always greater than zero and less than one. Whenever two senators have no common votings, the probability becomes 0.5 (it can't decide if they belong to the same cluster yet).

For the purpose of having a definition of distance, we keep it between zero and one simply by using:

\(dist = 1 - prob\)

Be aware that, even with an unquestionable formula (feel free to improve/change mine), k-means can't guarantee by itself that we'll find the best global solution (only local). Yet, (spoilers) the ending results are satisfactory.

Centroid

Another requirement for k-means is to have a way of calculating each cluster centroid, whose coordinates are the average of every object inside the cluster. With that in mind, I had to swap the true/false values for decimal numbers 1.0 and 0.0, respectively (a short Ruby method was entrusted with that brave task).

This way, each coordinate of a centroid becomes the percentage of affirmative votes in a voting. E.g.: in a voting where 5 senators of a cluster said YES, and 3 others said NO, the value of this coordinate within this centroid will be \(5 \over 5+3\), or 0.625.

Algorithm

The last requirement to mention is that k-means needs us to pick a value for k, the number of clusters in which objects must be distributed (a lot of people involved with machine learning despise that downside sad). We start here with the case for k = 2, meaning we'll try to split the Senate in two pieces.

The algorithm (fully and finally) is described now: first, we pick two random points, one for each cluster (given k). We make up a couple of random factors x, with \(0 < x < 1\), for each of the votings (coordinates). These become the startup centroids (in case they're each at the extreme-right and extreme-left, it's probably just a terminology coincidence :P).

Next, we proceed with an iteration: for each of the senators, we calculate their distance to each cluster centroid and assign the senator to the one with shorter distance. The iteration ends up by calculating the new centroid for each cluster (since new senators end up in the clusters).

We repeat that iteration, which may end up with a different set of clusters. If nothing changes when compared to the previous iteration (the senators remain together in the same clusters), it means the algorithm has converged and there's no point insisting on it.

For bigger values of k, we kick off with more clusters (by creating more arbitrary centroids), and we must since ensure no cluster becomes empty throughout the iterations. In case this happens, a new arbitrary centroid must be calculated to replace it, and a new iteration follows.

Since we can't guarantee the best global solution, our procedure is to register (in a file) every result we get with the algorithm's conversion. The results are appended to the file dozens, hundreds of times. Each time, k new centroids are randomly used, and the results (probably) vary in the end. With all convergence results registered, I can later check which was the most common result.

Results

As indicated by the following table, we ran the algorithm a thousand times for each value of k listed. Last two columns refer to the best candidate to global solution (the most common result).

k algorithm runnings number of occurences of the most common result relative frequency
2 1000 317 31.7%
3 1000 175 17.5%
4 1000 6 0.6%
5 1000 2 0.2%

As k increases, the chance of getting repeated results falls. I decided to ignore senators with less than 10 open votings (I actually googled their names to understand such absence), since many convergences left them in an isolated group (specially for k > 3), while the remaining senators got sorted out in k - 1 clusters.

Also, during the runnings, I could notice an increase in how many iterations were necessary to converge (in average) as k grew. The running time for a thousand trials went from a little over a minute (k = 2) to over an hour (k = 5, in which case only one result reappeared).

One might say the way I implemented the algorithm wasn't proper to bigger values of k. I speculate this was mostly due to the chosen formula for distance, which is not euclidian as most references say it should be for k-means.

For the curious ones, though, check this: through these identities (which I couldn't even think of by myself), we see that, among 85 senators, the possible ways to cluster them is absurdly huge. For k = 2, it goes around the power of \(10^{25}\). For k = 3\(10^{39}\). For k = 4\(10^{50}\). And for k = 5\(10^{57}\). Therefore, even for very small percentages, getting a single repeated result is too rare for us to take it as a simple coincidence.

See below how a classroom full of senators would end up if the teacher asked them to split into...

k = 2

Acir Gurgacz Aloysio Nunes Ferreira
Alfredo Nascimento Alvaro Dias
Ana Rita Ana Amélia
Angela Portela Ataídes Oliveira
Anibal Diniz Aécio Neves
Antonio Carlos Rodrigues Blairo Maggi
Antonio Carlos Valadares Cristovam Buarque
Armando Monteiro Cyro Miranda
Benedito de Lira Cássio Cunha Lima
Casildo Maldaner Cícero Lucena
Ciro Nogueira Fernando Collor
Clésio Andrade Flexa Ribeiro
Delcídio do Amaral Jarbas Vasconcelos
Eduardo Amorim Jayme Campos
Eduardo Braga José Agripino
Eduardo Lopes Lúcia Vânia
Eduardo Suplicy Mário Couto
Epitácio Cafeteira Osvaldo Sobrinho
Eunício Oliveira Paulo Bauer
Francisco Dornelles Pedro Simon
Gim Pedro Taques
Humberto Costa Randolfe Rodrigues
Inácio Arruda Roberto Requião
Ivo Cassol Ruben Figueiró
Jader Barbalho Waldemir Moka
Jorge Viana Wilder Morais
José Pimentel  
José Sarney  
João Alberto Souza  
João Capiberibe  
João Durval  
João Ribeiro  
João Vicente Claudino  
Kátia Abreu  
Lindbergh Farias  
Lobão Filho  
Luiz Henrique  
Lídice da Mata  
Magno Malta  
Mozarildo Cavalcanti  
Paulo Davim  
Paulo Paim  
Ricardo Ferraço  
Rodrigo Rollemberg  
Romero Jucá  
Sérgio Petecão  
Sérgio Souza  
Valdir Raupp  
Vanessa Grazziotin  
Vicentinho Alves  
Vital do Rêgo  
Walter Pinheiro  
Wellington Dias  
Zeze Perrella  

 

k = 3

Acir Gurgacz Alfredo Nascimento Aloysio Nunes Ferreira
Ana Rita Antonio Carlos Rodrigues Alvaro Dias
Angela Portela Armando Monteiro Ana Amélia
Anibal Diniz Ciro Nogueira Ataídes Oliveira
Antonio Carlos Valadares Clésio Andrade Aécio Neves
Benedito de Lira Eduardo Braga Blairo Maggi
Casildo Maldaner Eduardo Lopes Cristovam Buarque
Delcídio do Amaral Eunício Oliveira Cyro Miranda
Eduardo Amorim Francisco Dornelles Cássio Cunha Lima
Eduardo Suplicy Ivo Cassol Cícero Lucena
Epitácio Cafeteira Jader Barbalho Flexa Ribeiro
Fernando Collor José Sarney Jarbas Vasconcelos
Gim João Alberto Souza Jayme Campos
Humberto Costa João Vicente Claudino José Agripino
Inácio Arruda Lobão Filho Lúcia Vânia
Jorge Viana Luiz Henrique Mário Couto
José Pimentel Mozarildo Cavalcanti Paulo Bauer
João Capiberibe Romero Jucá Pedro Simon
João Durval Sérgio Petecão Pedro Taques
João Ribeiro Valdir Raupp Ruben Figueiró
Kátia Abreu Vanessa Grazziotin Wilder Morais
Lindbergh Farias Zeze Perrella  
Lídice da Mata    
Magno Malta    
Osvaldo Sobrinho    
Paulo Davim    
Paulo Paim    
Randolfe Rodrigues    
Ricardo Ferraço    
Roberto Requião    
Rodrigo Rollemberg    
Sérgio Souza    
Vicentinho Alves    
Vital do Rêgo    
Waldemir Moka    
Walter Pinheiro    
Wellington Dias    

 

k = 4

Acir Gurgacz Aloysio Nunes Ferreira Alvaro Dias Alfredo Nascimento
Ana Rita Ataídes Oliveira Ana Amélia Antonio Carlos Rodrigues
Angela Portela Ivo Cassol Aécio Neves Armando Monteiro
Anibal Diniz Jader Barbalho Blairo Maggi Ciro Nogueira
Antonio Carlos Valadares Jarbas Vasconcelos Cristovam Buarque Clésio Andrade
Benedito de Lira Jayme Campos Cyro Miranda Eduardo Braga
Casildo Maldaner Luiz Henrique Cássio Cunha Lima Eduardo Lopes
Delcídio do Amaral Paulo Bauer Cícero Lucena Eunício Oliveira
Eduardo Amorim Wilder Morais Flexa Ribeiro Francisco Dornelles
Eduardo Suplicy   José Agripino José Sarney
Epitácio Cafeteira   Lúcia Vânia João Alberto Souza
Fernando Collor   Mário Couto João Vicente Claudino
Gim   Osvaldo Sobrinho Lobão Filho
Humberto Costa   Pedro Simon Mozarildo Cavalcanti
Inácio Arruda   Pedro Taques Romero Jucá
Jorge Viana   Randolfe Rodrigues Sérgio Petecão
José Pimentel   Roberto Requião Valdir Raupp
João Capiberibe   Ruben Figueiró Vanessa Grazziotin
João Durval   Waldemir Moka Zeze Perrella
João Ribeiro      
Kátia Abreu      
Lindbergh Farias      
Lídice da Mata      
Magno Malta      
Paulo Davim      
Paulo Paim      
Ricardo Ferraço      
Rodrigo Rollemberg      
Sérgio Souza      
Vicentinho Alves      
Vital do Rêgo      
Walter Pinheiro      
Wellington Dias      

 

k = 5

Acir Gurgacz Aécio Neves Aloysio Nunes Ferreira Alfredo Nascimento Alvaro Dias
Ana Rita Cássio Cunha Lima Ataídes Oliveira Antonio Carlos Rodrigues Ana Amélia
Angela Portela Cícero Lucena Ivo Cassol Armando Monteiro Blairo Maggi
Anibal Diniz Epitácio Cafeteira Jader Barbalho Ciro Nogueira Cristovam Buarque
Antonio Carlos Valadares Flexa Ribeiro Luiz Henrique Clésio Andrade Cyro Miranda
Benedito de Lira Jarbas Vasconcelos Paulo Bauer Eduardo Braga Jayme Campos
Casildo Maldaner José Agripino Wilder Morais Eduardo Lopes Lúcia Vânia
Delcídio do Amaral Vital do Rêgo   Eunício Oliveira Mário Couto
Eduardo Amorim     Francisco Dornelles Osvaldo Sobrinho
Eduardo Suplicy     José Sarney Pedro Simon
Fernando Collor     João Alberto Souza Pedro Taques
Gim     João Vicente Claudino Randolfe Rodrigues
Humberto Costa     Lobão Filho Roberto Requião
Inácio Arruda     Mozarildo Cavalcanti Ruben Figueiró
Jorge Viana     Romero Jucá Waldemir Moka
José Pimentel     Sérgio Petecão  
João Capiberibe     Valdir Raupp  
João Durval     Vanessa Grazziotin  
João Ribeiro     Zeze Perrella  
Kátia Abreu        
Lindbergh Farias        
Lídice da Mata        
Magno Malta        
Paulo Davim        
Paulo Paim        
Ricardo Ferraço        
Rodrigo Rollemberg        
Sérgio Souza        
Vicentinho Alves        
Walter Pinheiro        
Wellington Dias        

 

Our analysis for lesser values of k turned out to be pretty concrete, allowing us to perceive the main two or three ways of politics in our Senate. And that's without recurring to any rules, ideologic or partidary concepts, nor the votings' themes - simply through the similarity of registries.

What you do based on that information is up to you. You can find out much more just through the XML we worked with, imagine with all the other sources? You might also want to check some tools developed by Radar Parlamentar, which are way more elaborate and also served as inspiration for this activity. If you have any doubt or something to add up here, please comment or leave me a message! Thanks for being curious and patient yes

 

 

 

 

 





Read more about: