Data engineering over the Titanic disaster

Gravatar published this on

Data Science Tutorials

In this article we are going to participate in a challenge of Data Science proposed by Kaggle. This challenge consists in analizing data of passengers from Titanic and build predictions about their fate on the tragic night of the accident.

One of the requirements is the environment to be installed in your machine. I suggest you download RStudio, wich is an IDE that greatly speeds things up and helps a lot, specially with variables and graphics. Another choice is to download only the basic environment (mirrors found here:

There is another pre-requisite we must meet before begin analizing the data: the data itself. Each Kaggle challenge has its own page, ours is Click on 'Data' and download the files 'train.csv' and 'test.csv'. You'll probably be asked to login before download starts.

The file 'train.csv' will be used as as training set. Random trees are supervisioned learning algorithms, i.e. they need to be fed the expected output along with the input data, so we'll be using the column 'Survived' inside that file as the output when training our model. After our prediciton model is done 'test.csv' will be used to test it.

Now roll up your sleeves and let's get to work!

Programming with R

Open the enviornment. Set the working folder path to where both csv files are and load them, just like in the code below.

# Set working folder path

# Read both files as data frames
train <- read.csv("~/Documents/R/train.csv", stringsAsFactors = FALSE)
test <- read.csv("~/Documents/R/test.csv", stringsAsFactors = FALSE)

If you're using RStudio type 'View("train")' to see the data we just loaded. Otherwise you can use 'head(x, n)' to show the first n elements of container x.

Columns inside train.csv
Column Description
PassengerID passenger identifier
Survived survived or not
Pclass passenger economic status (1 higher, 3 lower)
Name name
Sex sex
Age age
SibSp number of siblings/spouses aboard
Parch number of children/parents aboard
Ticket ticket number
Fare fare price
Cabin cabin identifier
Embarked port of embarkation (Cherbourg, Queenstown or Southampton)

Glancing at the table we can note something quite serious: the columns Age and Cabin have a lot of missing values. Thinking about it, both these information seem very relevant: passengers staying near the colision area would have less chance for survival, specially because the accident happened at night when most of them were asleep; on the other hand, women and children would have had more luck since the captain issued an order to save them first.

Two other columns that are also incomplete (though in a much lesser extent) are Embarked and Fare. To find the holes enter 'which(combi$Embarked == "")' and 'which($Fare))'. We are going to plug them now, figuratively speaking. First, estimate the fare price as being the average amount other passengers with same Embarkation, Pclass and Age paid for their ticket. For both passenger without Embarkation value assign S since most people embarked there. Before we start working on the data we should first merge test and train sets together so our discoveries are reproduced in both.

# Merge both sets
test$Survived <- NA
combi <- rbind(train, test)

# Fix missing Fare
combi$Fare[1044] <- mean(combi[combi$Pclass == 3 & combi$Embarked == 'S' & combi$Age > 60, 'Fare'], na.rm = TRUE)

# Fix missing Embarkations
combi$Embarked[c(62,830)] = "S"
combi$Embarked <- factor(combi$Embarked)

Back to the table, now we see that all the other columns are complete, and that Name is quite interesting. "Why?", I hear you ask. Well, we can group people by their surname and see if we find something useful, even compare with the number of relatives aboard (maybe someone got cut off accidentally). There is something more, though. Look closely how the names are structured. Every single one has a title: Master, Mr, Miss etc. We can use the average age of each group to infer the missing ages!

# Get surname and title from Name
aux <- strsplit(combi$Name, "(,\\s)|[.]")
# Split cuts one of the names in too many pieces. Fix with line below.
aux[514][[1]] <- aux[514][[1]][-4]
aux <-'rbind', aux)
combi$Surname <- aux[,1]
combi$Title <- aux[,2]

# Show in a table

A lot of these titles are given to people with similar economic-social status. We shall replace some of them after we obtain the average passengers age by title, that should ease building our prediction model.

# Find the average age by title and assign it to the passengers without.
for (n in 1:nrow(combi)) {
  if ($Age[n])) {
    combi$Age[n] <- mean(combi$Age[combi$Title == combi$Title[n]], na.rm = TRUE)
  } else {
    combi$Age[n] <- combi$Age[n]

# Merge similar titles
combi$Title[combi$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
combi$Title[combi$Title %in% c('Capt', 'Don', 'Major', 'Sir', 'Jonkheer')] <- 'Sir'
combi$Title[combi$Title %in% c('Dona', 'Lady', 'the Countess')] <- 'Lady'

# Use as categories
combi$Title <- factor(combi$Title)

Now let's group people with their relatives. The premise is this: bigger families would have more trouble staying together among the crowd. But wait, some names are fairly common, how can we distinguish actual relatives from other people they just happen to share the surname with? Not to fear! Just append the family size to the surname, that should fix it.

# Group people by family
combi$FamilySize <- combi$SibSp + combi$Parch + 1
combi$FamilyID <- paste(combi$FamilySize, combi$Surname, sep="")

# Prioritize big families
combi$FamilyID[combi$FamilySize <= 2] <- 'Small'
combi$FamilyID <- factor(combi$FamilyID)

All done! I believe we have enough assertions to build our prediction model. Begin by spliting combi into train and test original sets.

train <- combi[1:891,]
test <- combi[892:1309,]

You should install the package 'party' if you don't have it installed yet. This package has all the tools we need to build random forest trees.


The moment we were all waiting finally arrived!  Let's build our prediction model using the train set and test it after.

set.seed(456) # Used by the random number generator.
# Keep this number if you want to reproduce our model later!

# Build the prediction model.
fit <- cforest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FamilySize + FamilyID, data = train, controls=cforest_unbiased(ntree=2000, mtry=3))

# Predict using the test set.
Prediction <- predict(fit, test, OOB=TRUE, type = "response")

# Save to an output file
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = "cforest.csv", row.names = FALSE)

We passed as parameters to ctree the test set and specified its most important variables. We also passed the number of trees to be used (ntree) and how many variables will be analyzed at each node of one tree (mtry). The last lines of code save the prediction to a file, which will be sent to kaggle. Do it now.


Our submission scored approximately 80%. Not bad, considering how fast and easy our methods were.

Best entry of - Data Science Challenge

At this moment, the challenge accepts until 10 daily submissions, so don't be afraid of investigating other hipotheses and play some more with the data. There's a lot to be done! For example, we never considered the ticket numbers. Some passenger have the same ticket number, but are from different families; probably friends who shared a cabin. Maybe they should be considered relatives, after all, in middle of a wreck, friends are family too.

Another point is the location of the cabins and the impact region. Even though most passengers have this information missing, we could extrapolate the cabins we know to passengers relatives and friends. Inspecting the layout of the ship may be useful in locating some passengers according to their class.

See what more you can dig up, and don't forget to look at other Kaggle challenges. I'll meet you there, just let me crunch some more numbers into my calculator. Now, where did I put that thing? 

Read more about: