# Articles from Infosimples

Read the latest articles published at Infosimples.com

Clients satisfaction survey results of 2015
published this on

Running a Company

Here at Infosimples, every year we conduct a satisfaction survey to collect the opinions of clients about our services and to understand the strengths and weaknesses of our services in the eyes of those who use it.

The survey results enable us to evolve and deliver a better service year after year.

On december 2015, we sent our most recent survey to 20 companies who use a data automation service we provide, and in this article we will share a compilation of the results.

## Infosimples solutions evaluation

The first part of the survey was about how much the solutions meet the clients' needs and how easy it is to integrate existing systems with the Infosimples APIs.

## Infosimples client support evaluation

The client support evaluation worried us because of the perception of agility with a level lower than we would like, even though the professionals were considered to have excellent/good qualification.

We are already taking actions to improve support agility. We expect to see an expressive improvement on the next satisfaction survey.

## Net Promoter Score (NPS)

This was our first Net Promoter Score (NPS) evaluation.

"How likely is it that you would recommend Infosimples to a friend or colleague?"

The answer to this question refleclects the client's loyalty and willingness to recommend the solution to others.

This metric usually ranges from -100 to + 100.

The result was a +60 NPS. We consider this a good result because:

• The software industry NPS in 2015 was +19*
• There were no detractors (scores 0 through 6) - 0% detractors
• Promoters (scores 9 and 10) represented 60% of respondents
• Passives (scores 7 and 8) represented 40% of respondents

We believe this metric will be more helpful from the next survey on, when it will be possible to analyse our NPS evolution.

We would like to thank all respondents for having dedicated some minutes to answer the questions. The results are being very important on our plan for the starting year.

We wish a great 2016 for everyone!

*Source: Satmetrix 2015 Consumer Net Promoter Benchmarks

TensorFlow is the Deep Learning library with the most growth potential
published this on

Data Science

Deep Learning libraries have gained popularity in Machine Learning applications. Here are the leading ones:

Library Support Top users
Caffe C++, Python NVIDIA, general industry
SINGA C++, Python Apache
Theano Python Yoshua Bengio (University of Montreal),
Torch C, Lua Yann LeCun (Facebook)

I'll not get into details on each one. It's worth mentioning that Theano and Torch are mostly used in cientific research, Caffe is mostly used in commercial Machine Learning applications and SINGA was launched recently by the Apache Foundation and seems to compete mostly with Caffe.

At Infosimples we mostly use Caffe because it has great support for processing images. Any of these libraries is a great choice and they have great communities with active development.

A new options was introduced yesterday and that's why I've decided to write this mini article. I'm talking about TensorFlow, developed by a group of renowned researchers at Google. The researchers themselves are a top reason to call our attention. They have published great articles in Machine Learning conferences and some of them have contributed to the development of the libraries I mentioned in the table above.

In addition to being flexible to be used by the cientific community, TensorFlow is also a great choice to deploy commercial applications of Machine Learning, with great support to run in servers, desktops and mobile devices.

I think this tool have everything in place to become a leading resource in Machine Learning development. So, if you want to get into this field, now is a great time. I recommend you to begin with the tutorials of TensorFlow.

I hope to bring more news soon. I'll leave the TensorFlow launch video below.

A little about how Shazam works
published this on

Data Science Tutorials

## Have you ever heard of Shazam?

Shazam is a very popular mobile app that allows you to recognize almost any song playing at your surroundings. You may be on a book store, listening to the beats of a soothing jazz or at your favorite American-style coffee shop, where the Jukebox rocks a contagious but unknown country-style music. Just pop the app, record a small piece of music and Shazam will tell you its name, its artist and where to buy it. Booya!

##### Just give it some seconds to guess (almost) any music.

In this article, which was inspired and based on this readings, I'll try explain the basics of its working. How it knows so much? And how it does it so fast?

## How does a computer see a music?

Physically, a piece of sound is a mechanic wave, which goes through the air in different heights and frequencies causing our tympanums to vibrate. Visually, it can be represented by a messy scrawl, with peaks and lows. If you ever played with Windows’ Sound Recorder you may know what I'm talking about.

#### When the computer records a sound, it does it by taking samples, in identical intervals, of the height of that continuous wave, generating a plot full of dots:

Example of how a computer samples a sound. T is the interval between each sample.

If those dots are close enough to each other, the computer will have an easy time recomposing it. If that happens, you can imagine that they are so squeezed together that they look almost identical to the original wave. Therefore, the higher the sample rate (i.e.: the lower the time interval), the closer the digital sound is to the original sound. In other words, a good sampling rate provides a clear sound.

To give you an idea, an usual music CD stores its audio at 44100 samples per second (44,1kHz) and so does the majority of the music you listen on your smartphone or MP3 player. A DVD stores its audio files at more than double this rate (96kHz). Audiophiles claim that crystal-clear music is only possible to achieve at 192kHz.

Besides sound quality, an acceptable sample rate is important because the computer needs to reconstruct the continuous signal correctly:

The image above is an example of a incorrectly reconstructed wave. Notice how the recomposed signal (dashed in red) is different from the original sound (in black). This happens beacuse two samples are too far from one another, and the computer connects the dots in a wrong way. According to the Nyquist rule, - and its not too hard to see - the sample rate needs to be at least twice as the original wave frequency so the wave can be ‘redrawed’ with no flaws.

## A brief explanation on sines, cosines and complex numbers

Sinusoids and complex numbers are very important when dealing with any kind of signal. I’ll try to simplify some things, so a high-school knowledge must be sufficient to get an idea on what’s going on. Here’s a resume of the most important concepts:

Sines and cosines:
A sinusoidal wave can be written as it follows:

$$f(t) = D + Asin(k + \omega t)$$

A sinusoid with all its parameters.

Keep in mind that the most important parameters in this function are the amplitude A and the angular frequency $$\omega$$, since they permanently change the forms of the wave and tell us how it behaves along time, whereas D and k just deslocate it through the graph. 'A' tells how high the wave can go and $$\omega$$ tells how fast the wave goes through the time axis. If you have both of them, you can completely describe the behavior of a waveform.

Complex numbers

Complex numbers are an auxiliary tool in mathematics. In high school, most of you must have some hard time to find a utility to it. Although they don’t make any sense in real life, their properties can make mathematical operations a lot easier. They are very useful, for example, in Electrical Engineering, where we are constantly dealing with tensions, currents and much other periodic elements.

The most important sine and cosine parameters can be synthetized on a formula envolving the imaginary unit i as it foilows: $$f(t) = Ae^{i \omega t}$$. This may be unfamilliar for those who never studied complex numbers a little deeper. What does it mean to take a real number to the power of the imaginary unit?

As a result of the expansion of a MacLaurin series, from Calculus, it is possible to show that sines and cosines can be written using mixins of Euler unit to the power of i. If you want to go deeper, you can take a look at this webblog, (units 1 to 3) but for the understanding of this article it is sufficient to know that the main properties of a sinusoid can be condensed on the 'complex way' as $$f(t) = Ae^{i \omega t}$$, since it contains both A and $$\omega$$. $$f(t)$$ is also equal to the much more high-school familliar $$A(cos(\omega t) + isen(\omega t))$$, where we just made the angle $$(\text{usually } \theta)$$ time variable $$(\omega t)$$.

With this in mind, we can show a different way to describe those functions, with a graph just for 'A' and 'w'. We’ll soon see the reason for this.

##### Above, $$f(t) = sin(20 \pi t)$$. Notice A = 1 and $$\omega = \frac{20 \pi}{2 \pi} = 10$$. Below $$f(t), F(w)$$, where we just explicit A and $$\omega$$.

By doing this, we say that we are taking this function from the time domain to the frequency domain. This is very intuitive for simple sinewaves, but it will get a little more complex for other waveforms. Hopefully, there is a mathematical operation that takes us from one domain to another. It's called the Fourier Transform:

$$F(\omega) = \int_{-\infty}^\infty f(t){e}^{-i \omega t}\,\mathrm{d}t$$

f(t) denotes the time domain function and F(w) the frequency domain function. There is no need to understand integrals here, just take in mind that there is an algebraheic way to obtain the frequency function from the time function (and vice-versa - using the Inverse Fourier Transform).

Decomposing scrawls into sinewaves

To ease our understandings, lets strategically add some sinusoids with different frequencies and amplitudes with the help of a computer graphing software. Don't worry about the craziness of the function we chose, just take a look of its plot:

$$f(t) = \frac {4} {\pi}[sin(\pi t) + \frac {sin(3 \pi t)}{3}+ \frac {sin(5 \pi t)}{5}+ \frac {sin(7 \pi t)}{7}+ \frac {sin(9 \pi t)}{9}+ \frac {sin(11 \pi t)}{11}]$$

Note how this resembles the square wave:

Actually, if we keep infinitely adding sinusoids with this pattern we'll end up exactly with the square wave:

Cool, huh?

The point here is we can achieve very cool waveforms just by adding sinewaves. In fact, if we do it correctly, we can achieve any periodic waveform just by adding sinewaves.

Any waveform (even crazy ones) can be written as a sum of sines and cosines!

This decomposing is useful simply because trigonometric functions are very easy to work with.

Although, it may be a little messy to describe the wave on the time domain as a summation of so many different amplitudes and frequencies. There's where the Fourier Transform becomes handy. As an example, look how a waveform composed of 6 sinewaves (in red) can be easily represented on the frequency domain (in blue). Both domains contain the same information, but the frequency domain presents it on a much cleaner way.

## Back to Shazam

Shazam compares the piece of sound you recorded to the different musics it has on its databases and returns the music that has the greatest maching ratio. It may sound simple, but take in mind that Shazam has data of over 11 million songs on its servers. It would take a very long time for the search to succeed if this comparison wasn't optimized.

That known, we need to generate some elements from our sound record that can be rapidly compared to the elements on Shazam's server side, which were generated on the same way. We'll call these elements fingerprints.

A litte schema on how Shazam does its jobs

Some paragraphs above, we discussed how a waveform can be decomposed into sinewaves, which can be written in terms of its frequencies just by applying the Fourier Transform on it. Suppose you want to do it on your music record just to see what you get.

On the stepway, you'll see there's a problem. Remember how the waveform seen by a computer is a plot full of dots instead of a continuous curve? For that reason you cannot apply the Fourier Transform as described above, since the integral operation isn't defined for non-continuous functions.

Because of that, instead using the 'conventional' Fourier transform, we'll use its discrete counterpart, called Discrete Fourier Transform, or simply DFT, whose formula is:

Where:

- $$X(n)$$ is the frequency domain of the n-th bin

- $$N$$ is the window size

- $$x[k]$$ is the time domain wave function you want to transform

Notice some strange parameters appeared here. That's because we are going to apply the DFT several times on consecutive parts of the soundwave and not on the entire wave at once, as we did on the first example. This is to obtain the different frequency spectrums at each small period of time, which we will call chunk. In other words, we'll obtain lots of Amplitude x Frequency graphs (X(n)), one for each chunk.

Here we highlighted a chunk with 0.5s of sound and its frequency spectrum. There will be a pair of these each 0.5s of music. Notice we have multiple different frequencies here, differently from our upper example

The parameter that defines the length of this chunk is the upper case N and is called Window Size. The greater the window size (i.e.: the greater the length of the waveform captured), the larger our chunks will be and the less the number of Amplitude x Frequency graphs (frequency spectrums) will be generated.

A window, whose size is a mandatory power of 2, mustn't be either too high or low to correctly generate the spectrums. A high window size provides a low number of spectrum samples, but is more accurate on determining their frequencies. The opposite is valid for low window sizes. A good window size is between 1024 and 4096.

This operations can be easily done on a computer with help of Ruby or Python math modules. Further, we'll take some time optimizing some algorithms to do the calculations faster

What do you mean by 'accurate'?

On the last paragraph I mentioned about accuracy when windowing a sound, saying wider windows provides greater exactitude to frequencies. To understand that, we'll go a little deeper into Fourier transforms.

For understanding purposes, let's assume the piece of sound you recorded is a continuous wave. As we saw, we need to take the Fourier Transform on a part of the piece of sound you recorded. As we are assuming the sound is continuous, we can multiply the function (i.e.: the sound) by a Heaviside function to make all its extention equals 0 except on the desired part. We can then, apply the original Fourier Transform on it.

Note: If it still sounds confusing, I recommend you spending some minutes reading the Heaviside function article. It’s basics are very easy to understand.

Denoting the piece of sound by $$S(t)$$, the Heaviside function by $$H(t)$$, the part of the piece of sound by $$PartOfS(t)$$ and the Fourier Transform by $$\mathcal{F}$$, we have:

$$PartOfS(t) = S(t) x H(t)$$

Applying the Fourier Transform on both sides:

$$\mathcal{F}[PartOfS(t)] = \mathcal{F}[S(t) x H(t)]$$

As of the convolution theorem:

$$\mathcal{F}[PartOfS(t)] = \mathcal{F}[S(t)] \bigotimes \mathcal{F}[H(t)]$$

The convolution is a mathematical operation just like the multiplication or the division, but there’s no need to understand it deeply. The point here is to show that the result of the Fourier Transform will be dependent on the function that multiplies the $$S(t)$$. If we happen to choose any different function than Heaviside's the result of the Fourier Transform will be different.

At first sight, it makes no sense to choose other function than Heaviside's. After all, we just want to make $$S(t) = 0$$ at any point other then the desired ones, right?

Actually not. If $$PartOfS(t)$$ is too small, an undesired phenomenon called spectral leakage can occur and some undesired frequencies may appear on the plot. In this case, it helps if we choose strategically the windowing function. Three examples other than the step function are the Hamming window, the Kaiser window and the Hann window.

Each window has its peculiarities, but in general, we can say a Fourier Transform using Hamming window provides consistent results for the majority of signals.

Back to accuracy, we saw that if you partition a piece of sound too much (a narrow window), spectral leakage is more likely to occur. That's one reason why you can't reduce a window size too much.

Also, it is possible to show - although this is not immediate as you may think - that, when applying the DFT, it is impossible to distinguish two frequencies with difference lower than $$AudioSampleRate/WindowSize$$. Therefore, this parameter, called frequency resolution, gets better as the window size widen. For example, if we choose to sample a 44.1 kHz music with a window size of 1024 it will be impossible to distinguish two frequencies of this music lower than 43.07Hz.

Since this occurs, we must think not in a continuous range of frequencies, but in a discrete world of frequencies, with various bins containing small ranges of them. From the example above, the first bin will contain frequencies that range from 0 to 43.07Hz, the second one will contain frequencies that range from 43.07Hz to 86.14Hz and so on.

By the way, 43.07Hz is a very bad frequency resolution. We must either increase the window size or lower the audio sample rate to get a greater frequency resolution (i.e.: lower frequency range).

What does the DFT give us?

When applying the DFT for a chunk (the small portion of the piece of sound you recorded), you'll end up with a bunch of complex numbers, which will give you information of both amplitude and frequency of that portion. These numbers will give you a general profile of that tiny sound, which will be used to build our fingerprints. Repeat that for the entire piece of sound and you'll obtain even more complex numbers, containing lots of information about your record.

The fact is, those complex numbers contain enough information to create matching criteria with the records on Shazam's database. In the next step we'll study an efficient way to manipulate them and generating our fingerprints.

## Fingerprinting and target zones

For each chunk, we'll have $$\frac {N}{2}$$ bins in the form of complex numbers (remember Nyquist rule?). Therefore, each chunk will contain a bunch of elements with amplitude and frequency information. We can create some rules just to condense the most significative information to our ears. On a sound, take in mind that to the human body, frequency represents the tone and amplitude represents the loudness;

Using a window size of 1024 with a sound containing a sample rate of 11025Hz, for example, we'll have 512 complex numbers for each chunk. We can then filter the numbers that contain less significative frequencies and the lower amplitudes. This will be done by the following:

• Create a array containing 6 empty arrays inside of it. This array will be called 'bands', and will contain the following:

• The first one containing chunk$$[A(0)..A(9)]$$, representing the amplitude of the 10 first bins of frequencies (5.38Hz to 102.18Hz = $$\frac {FrequencyResolution}{2}$$ to $$9FrequencyResolution + \frac {FrequencyResolution}{2}$$;

• The second one containing the elements chunk$$[A(10)..A(19)]$$, representing the next 10 bins of frequencies (107.56Hz to 204.49 Hz);

• The third one containing chunk$$[A(20)..A(39)]$$;

• The fourh one containing chunk$$[A(40)..A(79)]$$;

• The fifth one containing chunk$$[A(80)..A(159)]$$;

• The sixth one containin chunk$$[A(160)..A(511)]$$;

• Keep the greatest amplitude value with of each inner array and cut off the rest. Also, make sure to store the index of those amplitudes on a separate array;

• Take the amplitudes average;

• On the index array, just keep the indexes whose amplitudes are above that average.

Notice we are 'aliasing' the chunk with the average value of frequencies contained on it. As we don't know exactly the ones that are inside of it, its the most sensible thing to do.

Also, this weird scale on the first item is because the human ear have logarithmic perception, so we condense higher frequencies together on wider arrays. The following 3 steps assures we just keep the most significant sound.

After that we'll end up with the index array containing up to 6 values. Those indexes contain information of the frequencies whose amplitudes were the greatest of its bands. To obtain their numerical value, just do the multiplication ($$Index+ ½$$$$\times$$($$\frac {SampleRate}{N}$$) and you are good to go.

Notice this process will be done once for each bin, so you'll have an array with up to 6 values for each discrete time period. That given, our goal is now having a frequency x time function. Here is an example of how it will look:

5 relevant frequencies on the first chunk, 2 on the second one, 1 on the third one and so on...

Let's recapitulate what we've done so far. We applied the DFT on the original sound. This took us from the time domain to the frequency domain. Graphically, we had a Amplitude x Time plot and went to a Amplitude x Frequency plot. The final step described here took us from the Amplitude x Frequency plot to a Frequency x Time plot. Notice at this point we are getting metrics that are very comparable friendly.

The next step is to make them time efficient.

## Time eficiency

Imagine the scatterplot on the left side is actually a fingerprint of some music on Shazam's database and the one on the right is your recorded sound fingerprint.

Two identical 4-dots set

Eventually, we could make the right plot 'travel' through the left plot and, as the dots overlay, we'd say there's a finegrprint match. This would be very inneficient though, since we had to make lots of comparisons and repeat that for 11+ million different musics.

Let's then, create some rules to make this more efficient.

The base article sugests the concept of target zones. They will help us making those comparisons faster.

The rules that defines a target zone is the following:

• A point on the graph is biunivocally related to a target zone. A target zone's point is called anchor;

• The points will be enumerated from top to bottom, from left to right;

• A point's target zone consists of a set of the five subsequent points of the (point index + 3)th point. If it's impossible to take five points, the set will be composed of the highest number of points available.

• If the rule abose is impossible to apply to a given point, it has no target zone.

Points 2 and 5 and their respective target zones.

Then, we'll make associations between the anchor and each one of its target zone points. Therefore, each anchor will have at max 5 associations. They will be done both on server side, for all those 11+ million musics points, and on the client side, on the piece of sound you recorded. These associations will have the form 'key'=>'value' (a hashtable!).

On both server and client side, each key of a association will correnspond to an array containing the following information:

$$[f(point), f(anchor), \Delta t(point,anchor)]$$, where $$f$$ denotes frequency and $$\Delta t$$ denotes time interval

Notice that we used $$\Delta t$$ and not an absolute time as one of the identifiers. This assures our search will be time independent.

For the values, although, they are going to be slightly different. On server side, it will be an array containing the following:

$$[t(anchor), SongId]$$, where $$t$$denotes the absolute time and $$SongId$$ denotes the Id of the music on the database.

On the client side, it will just have $$t(anchor)$$.

So our associations will be:

On the server side$$[f(point), f(anchor), \Delta t(point,anchor)] => [t(anchor), SongId]$$

On client side$$[f(point), f(anchor), \Delta t(point,anchor)] => t(anchor)$$

Let's take an example:

Supposing the scatterplot above is from Michael Jackson's 'Beat It' and its ID on Shazam's database is 2243 (just a guess), the associations for point 2 are:

• $$[2, 2, 1] => [1, 2243]$$
• $$[6, 2, 1] => [1, 2243]$$
• $$[1, 2, 2] => [1, 2243]$$

• $$[2, 2, 3] => [1, 2243]$$

• $$[9, 2, 4] => [1, 2243]$$

Just take in mind all those frequencies and times on these associations are fictitional. 1~9Hz are really low frequencies that human ear can't even notice. In practice, those values range from 20Hz to 20kHz.

1s time steps are also too big. We generally work with steps of 0.064~0.256 seconds.

All associations from every point corresponds a song fingerprint. We finally got into them!

Notice how, on the keys, we are identifying pairs of frequencies (from the point relatively to the anchor) with their respect distance in time instead of taking the individual points with their absolute time. This makes our associations much more unique, which enhances our search while generating less ‘fake positives’.

If we consider a step of 0.128s and an average of 3 significant frequencies per time unit, a 3 minute song will have a fingerprint with about 27000 associations. This may sound big, but we'll see this its very efficient and much faster than the 'sweeping method'.

## Matching criteria

As we record a piece of (unknown) music, most of the keys from the record are going to be the very same from our database fingerprint. If the keys on the database are indexed, we just need to fetch all the ids containing that key. Do that for every key in the record and you'll end up with a list of music candidates for that record.

Obviously, the music with most key matches are the most likely to be the recorded one. Although, this will work for lots of records, it's not certain that this will always happen. For that reason, is sensible that we think on some scoring methods to optimize our music match.

## Scoring methods

So far, we have 95% of our job done. But notice there’s still an information on our fingerprints we didn’t use. On both the client side and server side fingerprints use the absolute time was kept untouched​. Let’s see how we can take advantage from this number.

Remember our first statement for target zones:

• A point on the graph is biunivocally related to a target zone.

That means an association key is very unique inside a song’s fingerprint. Although it’s possible for a song to contain the exact same key inside its fingerprint with a different value associated to it, this won’t happen very often. Therefore, a pair of different associations are very likely to have an equal time distance (i.e.: the absolute time of the first association subtracted from the apsolute time from the second association) on both the cellphone generated fingerprint and on Shazam’s fingerprint.

Notice we do two time related comparions: one between point and anchor - the third value on the key array - which will appear several time in different associations, but on a trio becomes very unique - and another between two different associations, using their absolute time - which is even more unique. This way we are creating a very powerful filter for fake positives.

Let’s then create a simple scoring criteria:

• For each association match between the record generated fingerprint and Shazam’s database fingerprint, we score a $$(+1)$$ for the referred music;

• For each pair of matched associations, take the time distance between them (one for the association pair on the cellphone record and one for the association pair on Shazam’s DB). If they are equal, we score a $$(+1)$$. If they are different, we score a $$(-1)$$.

• For the scoring process above, remember to ‘walk’ through pairs of associations. For example, if we got a match for $$([1, 2, 3], [2, 2, 1], [3, 4, 3], [1, 1, 1], [2, 1, 2])$$, you could take [1, 2, 3] and [2, 2, 1], then [2, 2, 1] and [3, 4, 3], then [3, 4, 3] and [1, 1, 1] and so on. Avoid mantaining the same key while changing the second one, because the static key could eventually be a fake positive. This would end up scoring multiple $$(-1)s$$ and ruinning our method.

• For each streak like the one above, score $$(+1)$$;

Remember time is quantized when windowing the wave, so the second comparison is very safe. Also, we can always change those methods to fit our needs and get better results.

## Optimizing our fingerprinting

Before start coding, we are going to take a look at some optimizations we can do to increase performance when searching on Shazam's database.

Downsampling

Before generating our fingerprints (both on server and client sides), it's a good idea to artificially reduce the sampling rate from the audio records. This will allow us to use a steady frequency resolution as we reduce the numbers of associations per fingerprint. As a bonus it will take less time to upload you record to Shazam's servers and it will consume less memory on your smartphone.

As an example, we can downsample a 44.1kHz music to 11kHz, mantaining the sample rate to increase the frequency resolution by 4 times its original value. Although these alterations affect dramatically the sound quality, the dominating frequency keeps the same and the fingerprints can be generated normally. To do this, we just take the average of each set of 4 amplitudes before creating the bands array, instead of taking them all.

An alternative to the DFT: The Cooley-Tukey algorithm

Let's get back to the DFT algorithm:

Suppose we chose a 1024 window size to apply a DFT on a 11025Hz music with duration of 3 minutes. We'll have then a (1024/11025) = 0.093s time interval between each chunk, which gives us (180/0.093) = 1936 chunks.

For each bin, we'll do (1024-1) summations and 1024 multiplications, totalizing 2047 operations per bin. Do that 1024 times (for each bin in the window). Multiply that value by 1936. You'll end up with more than 4 billion operations per music.

Even with a fast home PC, doing this for your entire 5000 music iPod library would take some days. And this doesn't seem to be very efficient…

Hopefully, there's a faster way to do it, called the Cooley-Tukey method, or simply the Fast-Fourier Transform (or FFT). It takes advantage of the periodicity of sines and cosines to calculate recursively complex values so they can be reused on further calculations.

There's no need to take this so deep. This will take some math to understand, but if you want to skip this part, go ahead. Just take in mind there's an optimal way to calculate the DFT and we will be using it for coding later.

First we will split the original DFT into 2 parts: one containing just the odd components and one containing the even ones. Let's start with the even part.

To take the even values of the summation, just let $$k = 2m$$. As we also want 0 to be the lower bound of summation (just to keep the notation), we keep the index of summation $$m$$ and modify the upper bound of summation instead to $$(\frac {N}{2} - 1)$$. Thus, we get:

For the odd values, let $$k = 2m + 1$$. We get:

This way,

If we strategically put $$e^{-2 \pi i n/N}$$ (from now on we'll call it 'twizzle factor') in evidence on $$X(n)_{odd}$$, we get:

As we give the first summation the alias $$\bar X(n)_{even}$$ and the second one $$\bar X(n)_{odd}$$, we'll end up with:

As of the periodicity of the FFT:

Back to the twizzle factor, just by applying the distrubutive property we can see that:

Thus

The power of this method lies on these two expressions.

Also, for simplicity purposes, let's also define another function $$x:$$

$$F(n) = \begin{cases} \bar X(n/2) & \quad \text{if } n \text{ is even}\\ \bar X((n+1)/2) & \quad \text{if } n \text{ is odd}\\ \end{cases}$$

And supposing window size is N = 8:

Now $$X(0)$$ can be written on a different way:

Notice how F is dependent on $$\bar X(n)_{even}$$ and on $$\bar X(n)_{odd}$$ Remember that both of them contains half of the operations of the original DFT.

Also notice how we must calculate $$X(0),X(1),..., X(8)$$to obtain the full spectrum.

Now take a look at the highlighted expression above. If we take, let's say$$X(4)$$

Both $$X(0)$$and $$X(4)$$ use $$F(0)$$, $$F(1)$$ and the same twizzle factor. That means we can get two DFTs at once with the same calculations. Those also contain half of the operations of the original DFT! Take a look at the diagram below:

By its operation schema, this process is also known as 'Butterfly multiplication'

This method has time complexity of O(N*log(N)) and for big numbers will be lots faster than the original FFT.

Phew, that was a lot of information! Some things must have remained unclear, but everything will make sense when we see a practical application. For the moment, congrats for the patience reaching the end of this arcticle :)

Amazon Elastic File System overview
published this on

Events

Amazon Web Services (AWS) has announced in April of this year, during AWS Summit San Francisco, the new storage service, the Amazon Elastic File System (EFS). This product, still in preview, will be lauched in June and will be available initially on regions US-West (Oregon), US-East (N. Virginia) e EU (Ireland).

## AWS storage portfolio

AWS has already other 3 storage services the Amazon Simple Storage Service (S3), Amazon Elastic Block System (EBS) and Amazon Glacier. Let's briefly compare these services and the brand new EFS.

AWS storage services comparison
Service Storage type Access/latency Cost Access freq.
Amazon S3 Objects API's REST/Low latency $0,030 per GB-month High Amazon EBS Blocks Single EC2 instance /Lowest latency$0.100 per GB-month High
Amazon Glacier Archive API's REST/High latency $0.010 per GB-month Low Amazon EFS File System Multiple EC2 instances/low latency$0,300 per GB-month Alta

## Amazon EFS features

EFS' main features are that the file system is elastic, scalable and can be shared between thousands of EC2 instances. This sharing is very simple and fully managed. It's simple to use because it provides standard file system semantics and it acts like any other file system that you can mount on your EC2 instance. You can mount it on many instances in an Amazon region and it will be automatically shared with any other instance in which you mounted the EFS. This way you garantee high durability and availability.

The EFS is completly scalable and eslastic and can grow up to PB. The elasticity is automatic, that means it's not necessary to require more or less disk space during your use. It has no minimum storage space to start using EFS.

Another important feature is the high performance. SSD-Based, EFS is designed to have high throughput, IOPS and low latency. These features are automatically scalable while your file system grows.

## EFS security

EFS provides a tight control on storage data access. The security of EFS has 3 layers:

1. Network: Only EC2 instances inside the same VPC can mount the file system.
2. Security groups: It's possible to restrict which security groups that can access the EFS.
3. User: Common file system permissions can be apllied to folders and files inside EFS.

## How does it work?

To access files into EFS on an EC2 instance it's necessary to mount the system. You have to create mount targets in each availability zone. Mount targets are EFS endpoints and have one IP address and a DNS name. When you create a EFS on AWS you receive these DNS names of each AZ. These names are used when you mount the file system on your EC2 instances. The figure shows the structure of EFS working flow.

EFS is in preview and as soon as we have access to the preview, we will demonstrate this tool.

Unsupervised learning shows which senators vote alike
published this on

Data Science Programming

On this article, I get into details of how to write a program capable of grouping Brazilian senators according to their votes. All the way from getting the necessary info, going through some code and math, and finally getting to the Senate's inner circles.

## Introduction

Brazilian's legislative scenery is complex and varies now and then. Basic comprehension of how our representatives behave is usually challenging even to the most well-informed of us. To the ordinary citizen, this information may not arrive, and if it does, it's usually uninterpretable or too prepared.

Luckily, law enforces our government's transparency and accountability; that valuable information can be found on online portals like the one about the Senate. Don't get too excited, though, as information access is another right we've been struggling to wholly get in Brazil.

I decided to start this activity during internship at Infosimples in order to extract impartial yet meaningful conclusions from politicians' votes. It was worth the effort, since it put under test some concepts seen at Udacity's Intro to Artificial Intelligence (well, it also brought up curious readers like you! Leave some feedback later).

The greatest invention of life
published this on

Running a Company

I've always felt that death is the greatest invention of life...
without death, life didn't work very well
because it didn't make room for the young.

It (the young) didn't know how the world was fifty years ago.
It didn't know how the world was twenty years ago.
It saw it as it is today, without any preconceptions,
and dreamed how it could be based on that.

We're not satisfied based on the accomplishment of the last thirty years.
We're dissatisfied because the current state didn't live up to their ideals.
Without death there would be very little progress.

One of the things that happens in organizations as well as with people
is that they settle into ways of looking at the world and become satisfied with things
and the world changes and keeps evolving and new potential arises
but these people who are settled in don't see it.

That's what gives start-up companies their greatest advantage.
The sedentary point of view is that of most large companies.

Amazon Web Services launches Frankfurt (Germany) region
published this on

Events

Amazon Web Services (AWS) launched today their second region in Europe, this time in Frankfurt, Germany. The new region already supports many services offered by AWS worldwide, including the traditional Elastic Compute Cloud (EC2)Elastic Block Storage (EBS)Auto ScalingElastic Load Balancing and Relational Database Service (RDS). Moreover, three new edge locations were included in Frankfurt for Route 53 and CloudFront.

Ireland and Frankfurt regions can now be used together to build multi-region applications with the assurance that all data are hosted in data centers within the European Union. The launch of a region in German soil is probably related to the restrictive data protection laws that companies in that country have to deal with, which limits what can be stored in data centers outside the country.

What is Data Matching?
published this on

Data Science

Data Matching is the task of finding records that refer to the same entity. Normally, these records come from multiple data sets and have no common entity identifiers, but data matching techniques can also be used to detect duplicate records within a single database.

Data Matching is also known as Record Linkage, Object Identification or Entity Resolution.

Identifying and matching records across multiple data sets is a very challenging task for many reasons. First of all, the records usually have no attribute that makes it straightforward to identify the ones that refer to the same entity, so it is necessary to analyze attributes that provide partial identification, such as names and dates of birth (for people) or title and brands (for products). Because of that, data matching algorithms are very sensitive to data quality, which makes it necessary to pre-process the data being linked in order to ensure a minimal quality standard, at least for the key identifier attributes.

Semantic Analytics with Google Cloud Dataflow (webinar)
published this on

Data Science Events

September 18, thursday, 9h AM PST

Google is working to provide a new service called Google Cloud Dataflow. It will enable large scale processing of realt-time streaming data, like tweets, activities or events. It may be offered integrated with other Google Cloud services, like Cloud Storage, Big Query and Pub/Sub.

Eric Schmidt, Solutions Architect at Google, will present Google Cloud Dataflow and share how he analyzed the sentiment of millions of tweets streaming real-time from the World Cup in order to track fan response to events happening on the field. He'll show how he calculated average sentiment per team and per match, correlated sentiment with keywords of turning points in the games, and tied it all together with a timeline visualization that lets you track how global fans feel throughout the match.

You can register to the event through the link: