Artificial Intelligence has brought lots of improvements to problems related with language processing, especially in the field of language translation. Other examples include suggestions to correct misspelled words or even speech recognition.

To test a simple algorithm that works with elements of a language, the developed program had the objective of being able to split a string without spaces into the words that compose that string. For instance, receiving the string "thisishowitshouldwork" should result in "this is how it should work".

The problem consists in being able to identify words, since it's impossible to list all the known english words and there is no model that can correctly generalize the structure of every word.

The approach was to use supervised learning and a database of texts as training data so that the agent could start to identify the different words that exist in the english language. By utilizing texts instead of just a list of thousands of different words, it's possible to identify the frequency in which every word tend to appear. Knowing the probability of each word showing up, the agent should be able to calculate the most probable segmentation of every string.

It's viable to say that the probability of a certain segmentation depends on the probabilities of each word obtained by that segmentation. By considering the words that show up before the one we are analysing, the results could be even more accurate but, with the relatively small quantity of training data available, it was only taken in consideration the individual probability of each word.

Another point that has to be taken into consideration is how to deal with words absent from the training data. It's impossible to say that any training database contain every known word, no matter how big it its. This issue was adressed by using Laplacian Smoothing to avoid unknown probabilities to being equal to zero. At last, the calculations should be normalized so it's possible to compare segmentations between a different number of words.

The probability of a single word can be expressed by:

where *n* is the number of times the word appeared on the training data, *T* the total of analysed words, *k* is a constant and *V* is the estimated total number of words on the english vocabulary. On this problem it was used* k *= 1 and *V *= 2,000,000.

The probability of a segmentation can be expressed by:

where *p1, p2, ... pn* are the possible words obtained on a segmentation.

Knowing how the calculations should be done, it's necessary to deal with the problem of hardware limitations. A string with n characters has 2^{n-1} ways of being split, which would result in over one million calculations just to identify the best segmentation of the string "thisishowitshouldwork".

To deal with this problem, it's possible to adjust the algorithm to always split the string in two pieces, *f* (first part) and *r* (rest). If the probability of *f* is *X* times bigger than *r*'s, the agent has likely found a good segmentation in *f*, and should try to segment *r* as well.

Read more...