String Segmentation using Machine Learning

Gravatar published this on

Data Science Programming

Artificial Intelligence has brought lots of improvements to problems related with language processing, especially in the field of language translation. Other examples include suggestions to correct misspelled words or even speech recognition.

To test a simple algorithm that works with elements of a language, the developed program had the objective of being able to split a string without spaces into the words that compose that string. For instance, receiving the string "thisishowitshouldwork" should result in "this is how it should work".

The problem consists in being able to identify words, since it's impossible to list all the known english words and there is no model that can correctly generalize the structure of every word.

The approach was to use supervised learning and a database of texts as training data so that the agent could start to identify the different words that exist in the english language. By utilizing texts instead of just a list of thousands of different words, it's possible to identify the frequency in which every word tend to appear. Knowing the probability of each word showing up, the agent should be able to calculate the most probable segmentation of every string.

It's viable to say that the probability of a certain segmentation depends on the probabilities of each word obtained by that segmentation. By considering the words that show up before the one we are analysing, the results could be even more accurate but, with the relatively small quantity of training data available, it was only taken in consideration the individual probability of each word.

Another point that has to be taken into consideration is how to deal with words absent from the training data. It's impossible to say that any training database contain every known word, no matter how big it its. This issue was adressed by using Laplacian Smoothing to avoid unknown probabilities to being equal to zero. At last, the calculations should be normalized so it's possible to compare segmentations between a different number of words.

The probability of a single word can be expressed by:

where n is the number of times the word appeared on the training data, T the total of analysed words, k is a constant and V is the estimated total number of words on the english vocabulary. On this problem it was used k = 1 and V = 2,000,000.

The probability of a segmentation can be expressed by: 

where p1, p2, ... pn are the possible words obtained on a segmentation.

Knowing how the calculations should be done, it's necessary to deal with the problem of hardware limitations. A string with n characters has 2n-1 ways of being split, which would result in over one million calculations just to identify the best segmentation of the string "thisishowitshouldwork".

To deal with this problem, it's possible to adjust the algorithm to always split the string in two pieces, f (first part) and r (rest). If the probability of f is X times bigger than r's, the agent has likely found a good segmentation in f, and should try to segment r as well.

Updating the segmentation probability:  

where X = 10 to strings shorter than 50 characters and X = 15 otherwise, to reduce the number of necessary calculations with longer strings.

Now that it's clear how to calculate the probabilities of each word and the probabilities of the different segmentations of any string, the program is ready to be tested.

Starting with the reference string "thisishowitshouldwork", let's see how the program deals with its different segmentations.

Segmentation Probability
  t hisishowitshouldwork 1,06e-07
  th is is how it should work 1,57e-05
  thi si show it should work 1,96e-06
  this is how it should work 1,14e-04
Table 1 - Probabilities of different segmentations of the string "thisishowitshouldwork"

It's possible to observe that, as expected, the correct segmentation is the one with higher probability.

Other results are listed below, where sample_size refers to the total amount of analysed words, vocab is the estimative of the size of the vocabulary, NULL_PROB is the minimum probability, for words that have never been seen before and split_factor is the variable X mentioned on the previous equation. The number following the segmentation corresponds to its probability.

<~~ k = 1, sample_size = 2.697.372, vocab = 2000000, NULL_PROB = 2.14e-07 ~~>
Please type a sequence of words to be segmented (or 0 to stop)
idontknowwhatweareyellingabout
i dont know what wear eye ll ing about
2.84e-06
Took 0.008062 seconds to calculate with split factor = 10
thepasswordisnewspaper
the pass word is news paper
2.03e-05
Took 0.002943 seconds to calculate with split factor = 10
itstheendoftheworldasweknowit
its the end of the world as we know it
5.92e-05
Took 0.009799 seconds to calculate with split factor = 10
idontunderstandwhyitdidnotworkthere
i dont under st and why it did not work there
3.55e-05
Took 0.035966 seconds to calculate with split factor = 10
closingtimeopenallthedoorsandletyououtintotheworld
closing time open all the doors and let you out in to the world
6.89e-06
Took 0.096056 seconds to calculate with split factor = 15
asyoucanseeitstillsplitswordsitshouldnt
as you can see it still split sword sit shouldnt
2.14e-06
Took 0.03346 seconds to calculate with split factor = 10
pleasedonttellmewhattodo
please dont tell me what to do
2.07e-05
Took 0.004592 seconds to calculate with split factor = 10
whatdoyouthinkofthenewguy
what do you think of the new guy
3.63e-06
Took 0.004329 seconds to calculate with split factor = 10
ihavetosayididnotexpectthattohappen
i have to say i did not expect that to happen
9.44e-06
Took 0.01048 seconds to calculate with split factor = 10

With less than 2.7 million words analysed, the results have satisfying accuracy. It's possible to notice that some strings were split into unexpected words, like "split sword sit shouldnt" instead of "splits words it shouldnt" and "the pass word is news paper" instead of "the password is newspaper". The main reasons behind those errors are the small quantity of training data used and the choice to consider only the individual probabilities of each word, disregarding its predecessors. Those segmentations make no sense, but are all made of valid words and were split like that because the combined probability of the words "split", "sword" and "sit" was higher than "splits", "words" and "it". It's even possible that the word "splits" for instance, didn't appear even once on the training data. 

Other errors are the segmentations "wear eye ll ing" instead of "we are yelling" and "uder st and" instead of "understand". These cases are consequences of the quality of the training data, once "ll", "ing" and "st" shouldn't be considered words by the program, but were included between the other 2.7 million words the agent lerned from.

With bigger and better training data and more sophisticated algorithms to calculate the probabilities, the results should be even better.





Read more about: