# Using Tableau Prep’s new Python integration to predict Titanic survivors Tableau Prep gains the capability to run Python scripts in version 2019.3, in beta as of this writing. This article will show how to use this capability to solve a classic machine learning problem.

Kaggle.com, a site focused on data science competitions and practical problem solving, provides a tutorial based on Titanic passenger survival analysis.

Competition Description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

https://www.kaggle.com/c/titanic/overview/description

The most famous Titanic passengers, Kate and Leo, don’t seem to be on the passenger list. Still, we can use this exercise to figure out if the movie’s end was statistically predictable. We know a few things about them. She’s rich, and a first class passenger, he’s poor and a third class passenger. She’s female, he’s male. They’re both relatively young, but not children. In data science these are called features, and part of the process is figuring out which features most influence the outcome.

The steps followed in this article closely mirror those in the kaggle Titanic tutorial. Where possible I have accomplished data grooming steps the Preppiest rather than the most Pythonic way. This article will be focused on Prep and Python, not on data science / machine learning / Python best practices.

The shoulders on which I stand in writing this article include Joshua Milligan, bashii-Iwahashi Tomohiro (thank you Google/Chrome for translating), and Tom.

You can find the finished prep flow file, along with the python files, on github here. Assuming you already have Tableau Prep >= 2019.3.1 installed, follow the instructions on the TabPy github site to set up TabPy.

## How does Prep work with Python?

The author of a Prep flow adds a script step, then configures the settings of that step to point to the TabPy server, open the appropriate .py file, and call the appropriate function within that file.

The Python code within that file defines the called function, which receives a pandas dataframe (if you’re unfamiliar, think of it as a simple spreadsheet in your computer’s memory) from Prep, does something with it, and returns a dataframe. If the return dataframe is different from the received dataframe, the author must write a second function called get_output_schema. This tells Prep what column names and datatypes to expect. The available datatypes are:

• prep_string()
• prep_int()
• prep_decimal()
• prep_bool()
• prep_date()
• prep_datetime()

For example, the python file below receives a dataframe from prep and uses the pandas groupby function to show the mean survived score by passenger class. As you see by the get_output_schema function, this py file only returns two columns to Prep, an int called Pclass and a decimal called Survived. It appears that the classes are reversed, as the richest passengers, the people with the highest survival chances, are in Pclass 3.

``````def pclass_survival(df):
train_df = df.query('TestData == 0')
return train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

def get_output_schema():
return pd.DataFrame({
'Pclass':prep_int(),
'Survived':prep_decimal(),
})``````

Yeah, that trailing comma in get_output_schema bothers me too. It doesn’t seem to matter.

An essential step to debugging code is being able to print the value of a variable. We can do this here using the console we used to start TabPy. I’ll insert a print statement into the pclass_survival function above, then observe the output in the TabPy console.

``````def pclass_survival(df):
train_df = df.query('TestData == 0')
print(train_df.tail()) # output for debugging
return train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)`````` output of a Python print statement in the console

If you have the console visible while you are working on your Prep flow, you may be alarmed to see Prep calling your Python scripts repeatedly. What prep is actually doing here is calling the get_output_schema function in your Python scripts to find out if anything has changed. Unless you make a change to your .py file, Prep won’t repeatedly run your main function. As it was explained to me by a Tableau software engineer:

We cache intermediate results if nothing has changed in the scripts or in one of the previous steps.
In console you should be able to see that as tabpy actually outputs what is being executed. The difference between execution and schema computation is the last couple of lines. In the first case, it will make a call to your execution method, in the second one, it’ll call get_output_schema.

A.G.

## Building the Titanic analysis Prep flow

Kaggle breaks the Titanic passenger list into test and train comma separated value (csv) files. The training file includes a populated “Survived” column, whereas the testing file does not. Because Python scripts called by Prep can only consume and return a single dataframe, and we’ll need access to both datasets in our final prediction Python script, we will need to union the test and train data as our first step in prep. To enable Python to tell the difference, we’ll add a “test” flag column.

To separate the datasets within each of our python scripts, we use the pandas “query” function to derive a training pandas dataframe, like this:

``````def predict_survival(df):
train_df = df.query('TestData == 0')
...``````

We will be using a few machine learning algorithms imported from the https://scikit-learn.org/stable/ library. These algorithms don’t accept continuous values or strings, so we have some work to do to transform this data set.

Age may be an important factor in predicting survival, but we have a problem. There are 86 rows in the training data with no age data. Let’s complete that data by filling those nulls with the average age for passenger sex and passenger class. We could do that with a Python script, but we can also do it two steps with Prep. The two steps are labeled “average age by sex and pclass” and “clean age” in the finished Prep flow.

We also need to transform several continuous columns into “buckets” or “bins”. I found this to be inconvenient in Python because the binning function used in the tutorial …

``````train_df['AgeBand'] = pd.cut(train_df['Age'], 5)

AgeBand	Survived
0	(-0.08, 16.0]	0.550000
1	(16.0, 32.0]	0.337374
2	(32.0, 48.0]	0.412037
3	(48.0, 64.0]	0.434783
4	(64.0, 80.0]	0.090909

for dataset in combine:
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[ dataset['Age'] > 64, 'Age']``````

… manually takes the bin threshold values and creates bins. This isn’t viable for any sort of long-term automated and maintainable solution.

The Preppie way to do this uses two steps. First, an aggregate step adds a max_age column to the flow.

Then, a cleaning step adds the below “AgeBands” calculated field. The “/4”, “/4*2”, “/4*3” operations serve to divide the data into quartiles:

``````IF [AgeAugmented] <= ([MaxAge]/4) THEN 0

ELSEIF [AgeAugmented] > ([MaxAge]/4) AND [AgeAugmented] <= ([MaxAge]/4*2) THEN 1

ELSEIF [AgeAugmented] > ([MaxAge]/4*2) AND [AgeAugmented] <= ([MaxAge]/4*3) THEN 2

ELSEIF [AgeAugmented] > ([MaxAge]/4*3) THEN 3

END``````

On the chance that the passengers title is statistically significant, we need to group the various titles into five. In python, this step looks like:

```for dataset in combine:
dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')```

… but in Prep, it’s a simple group and replace within a cleaning step.

The title survival.py script enables analysis of the survival rates by title, again using the Python groupby mean operation.

``````def title_survival(df):
train_df = df.query('TestData == 0')
return train_df[["Title", "Survived"]].groupby(['Title'], as_index=False).mean().sort_values(by='Survived', ascending=False)

def get_output_schema():
return pd.DataFrame({
'Title':prep_string(),
'Survived':prep_decimal(),
})``````

… and the result looks like this in Prep:

Optimistic news for Kate, but it’s not looking good for Leo. If Captain Smith had married them, both of their chances would have improved.

One common step when doing this sort of analysis is “feature engineering”, where you combine aspects of multiple features to improve the accuracy of the prediction. In this case we are going to create a new column called “family size” by combining the “sibling/spouse” (sibsp) and “parent children” (parch) columns. The Python function for this couldn’t be simpler:

``````def family_size(df):
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
return df``````

This function will add the “FamilySize” column to the dataframe. Since the dataframe we are returning is different, we need to describe the dataframe being returned with a get_output_schema() function. Note the FamilySize column has been added with an int datatype:

``````def get_output_schema():
return pd.DataFrame({
'TestData': prep_int(),
'AgeBands': prep_int(),
'Embarked': prep_string(),
'Pclass': prep_int(),
'Survived': prep_int(),
'Parch': prep_int(),
'SexNum':prep_int(),
'TitleNum': prep_int(),
'Title': prep_string(),
'Name': prep_string(),
'SibSp': prep_int(),
'Fare': prep_decimal(),
'FamilySize': prep_int(),
'Sex': prep_string(),
'Age': prep_decimal(),
})``````

## Evaluating algorithms

The author of the blog post on which I’m basing this article documents eight different algorithms. We’ll try three here: logistic regression, support vector machines and random forest. In the logistic regression code, we’ll not only run lr, but we’ll output the feature correlation as well:

``````def log_reg(df):
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

train_df = df.query('TestData==0')
test_df = df.query('TestData==1')

# move the survived column from the training data over to a separate dataset
# because it's the dependent variable we need to pass to the algorithm
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]

# drop the "survived" column from the test data since it's empty
X_test  = test_df.drop("Survived", axis=1).copy()

logreg.fit(X_train, Y_train) # tell the logistic regression algorithm to build
# a model based on the training data
Y_pred = logreg.predict(X_test) # predict survival on the test data

# output the accuracy of the algorithm
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
print(acc_log)
lr_score = pd.DataFrame(data={'algorithm':['Logistic Regression'],'score':[acc_log]})

# find coefficient for each feature
coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_)
coeff_df.sort_values(by='Correlation', ascending=False)
print(coeff_df)

return lr_score

def get_output_schema():
return pd.DataFrame({
'algorithm': prep_string(),
'score': prep_decimal(),
})``````

To step through the code above, first we divide the incoming dataframe into test and training dataframes, as discussed toward the top of this article. Then we move the dependent variable, “survived”, over to a separate dataframe because that’s what the algorithm needs. We then call the “fit” function, which tells the algorithm to build a prediction model based on the training data. Next, we call the “predict” function, telling the algorithm to predict survival based on the test data.

Acc_log is an accuracy score. We’ll ask for one of those from each of the algorithms.

coeff_df is a dataframe we build containing a row for each feature, and an estimate of how influential each feature is on the dependent variable, survival. Sex is most important, followed by wealth (fare) and title.

## Predicting Survival

Based on the algorithm evaluations, it makes sense to use random forest. Again, this article is more more focused on Prep+Python than pure data science, so this might not be the actual best practice.

``````def predict_survival(df):
from sklearn.ensemble import RandomForestClassifier

train_df = df.query('TestData==0')
test_df = df.query('TestData==1')

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]

X_test  = test_df.drop("Survived", axis=1).copy()

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)

X_test['survive_pred']=Y_pred
return X_test

def get_output_schema():
return pd.DataFrame({
'EmbarkedNum': prep_int(),
'Pclass': prep_int(),
'FamilySize': prep_int(),
'AgeBands': prep_int(),
'SexNum':prep_int(),
'TitleNum': prep_int(),
'IsAlone': prep_int(),
'Fare': prep_decimal(),
'FareBin': prep_int(),
'survive_pred': prep_decimal(),
})``````

This code is exactly the same as the linear regression code we stepped through before. Rather then calling the LR library, this code calls the random forest classifier. It calls the fit function, then the predict function, saving the result to a new column on the test dataframe. That test dataframe is then returned, and we can output it as the result of our flow:

## Summary

Tableau Prep can now make use of Python scripts. Getting it to work is pretty straight-forward, with a few technical details that we have covered in this article.

Feedback I have received from customers at this point is that it would be great to be able to call a function that has been published to TabPy, rather than depending on individual .py files. I think it would also be great to be able to pass and return multiple dataframes to a Python function. This is version 1.0 of this capability, so we’ll see how the team innovates in the future. 