Using Tableau Prep’s new Python integration to predict Titanic survivors

In the latest release of Tableau Prep Builder (2019.3), you can now run Python scripts from within data prep flows. This article will show how to use this capability to solve a classic machine learning problem.

Kaggle.com, a site focused on data science competitions and practical problem solving, provides a tutorial based on Titanic passenger survival analysis:

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

https://www.kaggle.com/c/titanic/overview/description

I decided to accomplish “data grooming” steps for this workflow in the Preppiest (rather than the most Pythonic) way, wherever possible because it seems like that’s what most Prep users would do in real life. The steps followed in this article closely mirror those in the kaggle Titanic tutorial. This article will be focused on Prep + Python, not on data science / machine learning / Python best practices.

The most famous Titanic passengers, Kate and Leo, don’t seem to be on the passenger list. Still, we can use this exercise to figure out if the movie’s end was statistically predictable. We know a few things about them. She’s rich, and a first class passenger, he’s poor and a third class passenger. She’s female, he’s male. They’re both relatively young, but not children. In data science these are called features and part of the process is figuring out which features have the most influence on the outcome.

You can find the finished prep flow file, along with the python files, on github here. Assuming you already have Tableau Prep >= 2019.3.1 installed, follow the instructions on the TabPy github site to set up TabPy.

How does Prep work with Python?

The author of a Tableau Prep flow adds a script step, then configures the settings of that step to point to the TabPy server, open the appropriate .py file, and call the appropriate function within that file.

configuration of a Python script step

The Python code within that file defines the called function, which receives a pandas dataframe from Tableau Prep (think of a dataframe as a simple spreadsheet in your computer’s memory), does something with it, and returns a dataframe. If the return dataframe is different from the received dataframe, the author must write a second function called get_output_schema. This tells Tableau Prep what column names and datatypes to expect. The available datatypes are:

  • prep_string()
  • prep_int()
  • prep_decimal()
  • prep_bool()
  • prep_date()
  • prep_datetime()

For example, the Python file below receives a dataframe from Tableau Prep and uses the pandas groupby function to show the mean survived score by passenger class. As you see by the get_output_schema function, this Python script only returns two columns to Prep, an int called Pclass and a decimal called “Survived”. It appears that the classes are reversed, as the richest passengers, the people with the highest survival chances, are in Pclass 3.

def pclass_survival(df):
	train_df = df.query('TestData == 0')
	return train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
	
def get_output_schema():
    return pd.DataFrame({
		'Pclass':prep_int(),
		'Survived':prep_decimal(),
    })
the output of pclass_survival as seen in Tableau Prep Builder

Yeah, that trailing comma in get_output_schema bothers me too. It doesn’t seem to matter.

An essential step to debugging code is being able to print the value of a variable. We can do this here using the console we used to start TabPy. I’ll insert a print statement into the pclass_survival function above, then observe the output in the TabPy console.

def pclass_survival(df):
	train_df = df.query('TestData == 0')
	print(train_df.tail()) # output for debugging
	return train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
output of a Python print statement in the console

If you have the console visible while you are working on your Prep flow, you may be alarmed to see Prep calling your Python scripts repeatedly. What Prep is actually doing here is calling the get_output_schema function in your Python scripts to find out if anything has changed. Unless you make a change to your .py file, the Tableau Prep flow won’t repeatedly run your main function. As it was explained to me by Arthur Gyldenege, a Senior Manager of Product Management at Tableau:

We cache intermediate results if nothing has changed in the scripts or in one of the previous steps.
In the console you should be able to see that as TabPy actually outputs what is being executed. The difference between execution and schema computation is the last couple of lines. In the first case, it will make a call to your execution method, in the second one, it’ll call get_output_schema.

Building the Titanic analysis Prep flow

Kaggle breaks the Titanic passenger list into test and train comma separated value (csv) files. The training file includes a populated “Survived” column, whereas the testing file does not. Because Python scripts called by Prep can only consume and return a single dataframe, and we’ll need access to both datasets in our final prediction Python script, we will need to union the test and train data as our first step in Prep. To enable Python to tell the difference, we’ll add a “test” flag column.

To separate the datasets within each of our Python scripts, we use the pandas “query” function to derive a training pandas dataframe, like this:

def predict_survival(df):
   train_df = df.query('TestData == 0')
   ...

We will be using a few machine learning algorithms imported from the https://scikit-learn.org/stable/ library. These algorithms don’t accept continuous values or strings, so we have some work to do to transform this data set.

Age may be an important factor in predicting survival, but we have a problem. There are 86 rows in the training data with no age data. Let’s complete that data by filling those nulls with the average age for passenger sex and passenger class. We could do that with a Python script, but we can also do it two steps with Prep. The two steps are labeled “average age by sex and pclass” and “clean age” in the finished Prep flow.

We also need to transform several continuous columns into “buckets” or “bins.” I found this to be inconvenient in Python because the binning function used in the tutorial manually takes the bin threshold values and creates bins. This isn’t viable for any sort of long-term automated and maintainable solution.

train_df['AgeBand'] = pd.cut(train_df['Age'], 5)

	AgeBand	Survived
0	(-0.08, 16.0]	0.550000
1	(16.0, 32.0]	0.337374
2	(32.0, 48.0]	0.412037
3	(48.0, 64.0]	0.434783
4	(64.0, 80.0]	0.090909

for dataset in combine:    
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']

The Tableau Prep way to do this uses two steps. First, an aggregate step adds a max_age column to the flow.

Then, a cleaning step adds the below “AgeBands” calculated field. The “/4”, “/4*2”, “/4*3” operations serve to divide the data into quartiles in an understandable way:

IF [AgeAugmented] <= ([MaxAge]/4) THEN 0

ELSEIF [AgeAugmented] > ([MaxAge]/4) AND [AgeAugmented] <= ([MaxAge]/4*2) THEN 1

ELSEIF [AgeAugmented] > ([MaxAge]/4*2) AND [AgeAugmented] <= ([MaxAge]/4*3) THEN 2

ELSEIF [AgeAugmented] > ([MaxAge]/4*3) THEN 3

END

On the chance that the passengers title is statistically significant, we need to group the various titles into five. In Python, this step looks like:

for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

… but in Prep, it’s a simple group and replace within a cleaning step.

The title survival.py script enables analysis of the survival rates by title, again using the Python groupby mean operation.

def title_survival(df):
	train_df = df.query('TestData == 0')
	return train_df[["Title", "Survived"]].groupby(['Title'], as_index=False).mean().sort_values(by='Survived', ascending=False)
	
def get_output_schema():
    return pd.DataFrame({
		'Title':prep_string(),
		'Survived':prep_decimal(),
    })

… and the result looks like this in Prep:

Optimistic news for Kate, but it’s not looking good for Leo. If Captain Smith had married them, both of their chances would have improved.

One common step when doing this sort of analysis is “feature engineering”, where you combine aspects of multiple features to improve the accuracy of the prediction. In this case we are going to create a new column called “family size” by combining the “sibling/spouse” (sibsp) and “parent children” (parch) columns. The Python function for this couldn’t be simpler:

def family_size(df):
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    return df

This function will add the “FamilySize” column to the dataframe. Since the dataframe we are returning is different, we need to describe the dataframe being returned with a get_output_schema() function. Note the FamilySize column has been added with an int datatype:

def get_output_schema():
    return pd.DataFrame({
        'TestData': prep_int(),
        'AgeBands': prep_int(),
        'Embarked': prep_string(),
        'Pclass': prep_int(),
        'Survived': prep_int(),
        'Parch': prep_int(),
        'SexNum':prep_int(),
        'TitleNum': prep_int(),
        'Title': prep_string(),
        'Name': prep_string(),
        'SibSp': prep_int(),
        'Fare': prep_decimal(),
        'FamilySize': prep_int(),
        'Sex': prep_string(),
        'Age': prep_decimal(),
    })

Evaluating algorithms

The author of the blog post on which I’m basing this article documents eight different algorithms. We’ll try three here: logistic regression, support vector machines and random forest. In the logistic regression code, we’ll not only run lr, but we’ll output the feature correlation as well:

def log_reg(df):
    from sklearn.linear_model import LogisticRegression
    logreg = LogisticRegression()

    train_df = df.query('TestData==0')
    test_df = df.query('TestData==1')

    # move the survived column from the training data over to a separate dataset 
    # because it's the dependent variable we need to pass to the algorithm
    X_train = train_df.drop("Survived", axis=1)
    Y_train = train_df["Survived"]

    # drop the "survived" column from the test data since it's empty
    X_test  = test_df.drop("Survived", axis=1).copy()

    logreg.fit(X_train, Y_train) # tell the logistic regression algorithm to build 
        # a model based on the training data
    Y_pred = logreg.predict(X_test) # predict survival on the test data

    # output the accuracy of the algorithm
    acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
    print(acc_log)
    lr_score = pd.DataFrame(data={'algorithm':['Logistic Regression'],'score':[acc_log]})

    # find coefficient for each feature
    coeff_df = pd.DataFrame(train_df.columns.delete(0))
    coeff_df.columns = ['Feature']
    coeff_df["Correlation"] = pd.Series(logreg.coef_[0])
    coeff_df.sort_values(by='Correlation', ascending=False)
    print(coeff_df)

    return lr_score
    

def get_output_schema():
    return pd.DataFrame({
        'algorithm': prep_string(),
        'score': prep_decimal(),
    })

To step through the code above, first we divide the incoming dataframe into test-and-training dataframes, as discussed toward the top of this article. Then we move the dependent variable, “survived”, over to a separate dataframe because that’s what the algorithm needs. We then call the “fit” function, which tells the algorithm to build a prediction model based on the training data. Next, we call the “predict” function, telling the algorithm to predict survival based on the test data.

Acc_log is an accuracy score. We’ll ask for one of those from each of the algorithms.

accuracy scores for each algorithm

coeff_df is a dataframe we build containing a row for each feature and an estimate of how influential each feature is on the dependent variable – survival. Sex is most important, followed by wealth (fare) and title.

Correlation of features to survival

Predicting Survival

Based on the evaluations of these few algorithms , it makes sense to use random forest. In real life a data scientist might evaluate more algorithms and combinations of algorithms to find the most accurate prediction.

def predict_survival(df):
    from sklearn.ensemble import RandomForestClassifier

    train_df = df.query('TestData==0')
    test_df = df.query('TestData==1')

    X_train = train_df.drop("Survived", axis=1)
    Y_train = train_df["Survived"]

    X_test  = test_df.drop("Survived", axis=1).copy()

    random_forest = RandomForestClassifier(n_estimators=100)
    random_forest.fit(X_train, Y_train)
    Y_pred = random_forest.predict(X_test)
    
    X_test['survive_pred']=Y_pred
    return X_test

def get_output_schema():
    return pd.DataFrame({
        'EmbarkedNum': prep_int(),
        'Pclass': prep_int(),
        'FamilySize': prep_int(),
        'AgeBands': prep_int(),
        'SexNum':prep_int(),
        'TitleNum': prep_int(),
        'IsAlone': prep_int(),
        'Fare': prep_decimal(),
        'FareBin': prep_int(),
        'survive_pred': prep_decimal(),
    })

This code is exactly the same as the linear regression code we used in the example above. Rather than calling the LR library, this code calls the random forest classifier. It calls the fit function, then the predict function, saving the result to a new column on the test dataframe. That test dataframe is then returned and we can output it as the result of our flow:

Summary of Titanic Learnings

Tableau Prep can now make use of Python scripts. Getting it to work is pretty straight-forward, with a few technical details that we have covered in this article.

Feedback on this article is encouraged and always welcome. Thanks for reading, and for using Tableau!

Acknowledgments

The shoulders on which I stand in writing this article include Joshua Milligan, bashii-Iwahashi Tomohiro (thank you Google/Chrome for translating), and Tom.

See Also

Now that you’re a fan of Prep and Python, take a look at Fearghal’s post on how to export CSVs: https://gunningfortableau.com/2019/10/14/tableau-as-traditional-bi/

Author: Larry Clark

Larry has been a Tableau employee since October of 2018. Prior to Tableau, Larry was a BI consultant for Catapult Systems. Prior to Catapult, Larry was in DB/BI technical sales for Microsoft for >20 years.

One thought on “Using Tableau Prep’s new Python integration to predict Titanic survivors”

Leave a Reply

Your email address will not be published. Required fields are marked *