Profiling with Decision Trees / Random Forest

Introduction

This project was completed as part of MIT IDSS Data Science and Machine Learning course. It has extra relevance for me given my experience in the education sector and previous analysis of churn models.

Extra Learn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by Extra Learn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly.

Data Source & Methodology

The data contains the different attributes of leads and their interaction details with Extra Learn. Some examples of variables provided in the dataset include: - age - source of first interaction with website - whether website profile was completed - number of vists to the website - time spent on the website - page views per visit - email / phone / website interaction with potential customer - whether the lead was referred to ExtraLearnby someone

Most importantly, the dataset contains the outcome variable status - a flag indicating whether the lead was converted to a paid customer or not.

With this dataset, we will use Decision Tree and Random Forest models to: * Analyze and build an ML model to help identify which leads are more likely to convert to paid customers, * Find the factors driving the lead conversion process * Create a profile of the leads which are likely to convert

Data Preparation

Show the code
# Importing the basic libraries we will require for the project

# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Importing the Machine Learning models we require from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn import tree
from sklearn.metrics import make_scorer,mean_squared_error, r2_score, mean_absolute_error, recall_score
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

# Importing the other functions we may require from Scikit-Learn
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer

# To get diferent metric scores
from sklearn.metrics import confusion_matrix,classification_report,roc_auc_score,precision_recall_curve,roc_curve,make_scorer

# Code to ignore warnings from function usage
import warnings;
import numpy as np
warnings.filterwarnings('ignore')

Load in dataset

Show the code
# Read the healthcare dataset file
data=pd.read_csv("ExtraaLearn.csv")

# Copying data to another variable to avoid any changes to original data
same_data = data.copy()

# View first 5 rows
data.head()
##        ID  age current_occupation  ... educational_channels referral  status
## 0  EXT001   57         Unemployed  ...                   No       No       1
## 1  EXT002   56       Professional  ...                  Yes       No       0
## 2  EXT003   52       Professional  ...                   No       No       0
## 3  EXT004   53         Unemployed  ...                   No       No       1
## 4  EXT005   23            Student  ...                   No       No       0
## 
## [5 rows x 15 columns]

Data Preprocessing

Outlier detection and treatment

Show the code
# Defining the hist_box() function
def hist_box(data, col):
    f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (0.15, 0.85)}, figsize=(12, 6))
    # Adding a graph in each part
    sns.boxplot(data=data, x=col, ax=ax_box, showmeans=True)
    sns.histplot(data=data, x=col, kde=True, ax=ax_hist)
    plt.show()
Show the code
num_col = ['age','website_visits','time_spent_on_website','page_views_per_visit']

for column in num_col:
  hist_box(data,column)

Observations * The variables for number of website vists, time spent on website and page views per visit were skewed towards the right * The distribution of age was mostly uniform except for a small spike of records around 60 years old

Observations

  • The variable with the most values beyond the upper IQR is page views per visit
  • Looking at the highest five values, they do not seem very far apart so it is plausible that these values are reflective of reality
  • Given this and the low rate of outlier values in the rest of the dataset, it was decided to leave data unchanged

Data preparation for modeling

Show the code
# We are removing the outcome variable from the feature set
# Also removing the variable ID as it is unique to each record and not required for analysis
X = data.drop(['status', 'ID'], axis = 1)

# And then we are extracting the outcome variable separately
Y = data['status']
Show the code
X = pd.get_dummies(X, drop_first = True)

Y.head()
## 0    1
## 1    0
## 2    0
## 3    1
## 4    0
## Name: status, dtype: int64
Show the code
# Splitting the data into train and test sets
X_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.30,random_state=1,stratify=Y)
Show the code
# Checking the shape of the train and test data
print("Shape of Training set : ", X_train.shape)
## Shape of Training set :  (3228, 16)
print("Shape of test set : ", X_test.shape)
## Shape of test set :  (1384, 16)

Exploratory Data Analysis

Show the code
X_train.describe().T
##                         count        mean  ...       75%       max
## age                    3228.0   46.155204  ...    57.000    63.000
## website_visits         3228.0    3.642813  ...     5.000    29.000
## time_spent_on_website  3228.0  727.674721  ...  1363.750  2537.000
## page_views_per_visit   3228.0    3.083167  ...     3.778    18.434
## 
## [4 rows x 8 columns]
Show the code
X_test.describe().T
##                         count        mean  ...        75%       max
## age                    1384.0   46.308526  ...    57.0000    63.000
## website_visits         1384.0    3.389451  ...     5.0000    30.000
## time_spent_on_website  1384.0  715.466763  ...  1283.7500  2531.000
## page_views_per_visit   1384.0    2.893083  ...     3.7095    18.302
## 
## [4 rows x 8 columns]
Show the code
y_train.describe().T
## count    3228.000000
## mean        0.298637
## std         0.457731
## min         0.000000
## 25%         0.000000
## 50%         0.000000
## 75%         1.000000
## max         1.000000
## Name: status, dtype: float64
Show the code
y_test.describe().T
## count    1384.000000
## mean        0.298410
## std         0.457726
## min         0.000000
## 25%         0.000000
## 50%         0.000000
## 75%         1.000000
## max         1.000000
## Name: status, dtype: float64

Observations

  • Splitting the X and Y datasets into a training and test set has resulted in evently distributed data across both datasets

Building a Decision Tree model

Show the code
# Function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# Function to compute MAPE
def mape_score(targets, predictions):
    return np.mean(np.abs(targets - predictions) / targets) * 100


# Function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    pred = model.predict(predictors)                  # Predict using the independent variables
    r2 = r2_score(target, pred)                       # To compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)    # To compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # To compute RMSE
    mae = mean_absolute_error(target, pred)           # To compute MAE
    mape = mape_score(target, pred)                   # To compute MAPE

    # Creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf
Show the code
# Creating metric function
def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))

    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8,5))

    sns.heatmap(cm, annot=True,  fmt='.2f', xticklabels=['Not Attrite', 'Attrite'], yticklabels=['Not Attrite', 'Attrite'])
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()
Show the code
# Decision Tree Regressor
dt_regressor = DecisionTreeClassifier(class_weight = 'balanced', random_state = 1, max_depth = 4)

# Fitting the model
dt_regressor.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=4, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Show the code

# Model Performance on the test data, i.e., prediction
dt_regressor_perf_test = model_performance_regression(dt_regressor, X_test, y_test)

dt_regressor_perf_test
##        RMSE       MAE  R-squared  Adj. R-squared  MAPE
## 0  0.426709  0.182081   0.130304        0.120125   inf

Observations

  • The initial Decision Tree on the training set has very low adjusted R square of ~12% and is likely not effective at predicting the outcome variable
Show the code
from sklearn import tree
features = list(X_train.columns)

# Building the model with max_depth=3
dt_regressor_visualize = DecisionTreeRegressor(random_state = 1, max_depth = 4)

# Fitting the model
dt_regressor_visualize.fit(X_train, y_train)
DecisionTreeRegressor(max_depth=4, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Show the code

plt.figure(figsize = (10, 5))
tree.plot_tree(dt_regressor_visualize, feature_names = features,  filled = True, fontsize = 6, class_names = True)
plt.tight_layout()
plt.show()

Show the code
print(tree.export_text(dt_regressor_visualize, feature_names=X_train.columns.tolist(), show_weights=True))
## |--- first_interaction_Website <= 0.50
## |   |--- time_spent_on_website <= 419.50
## |   |   |--- age <= 18.50
## |   |   |   |--- website_visits <= 4.50
## |   |   |   |   |--- value: [0.00]
## |   |   |   |--- website_visits >  4.50
## |   |   |   |   |--- value: [0.50]
## |   |   |--- age >  18.50
## |   |   |   |--- age <= 24.50
## |   |   |   |   |--- value: [0.01]
## |   |   |   |--- age >  24.50
## |   |   |   |   |--- value: [0.00]
## |   |--- time_spent_on_website >  419.50
## |   |   |--- last_activity_Website Activity <= 0.50
## |   |   |   |--- referral_Yes <= 0.50
## |   |   |   |   |--- value: [0.16]
## |   |   |   |--- referral_Yes >  0.50
## |   |   |   |   |--- value: [0.56]
## |   |   |--- last_activity_Website Activity >  0.50
## |   |   |   |--- profile_completed_Medium <= 0.50
## |   |   |   |   |--- value: [0.34]
## |   |   |   |--- profile_completed_Medium >  0.50
## |   |   |   |   |--- value: [0.59]
## |--- first_interaction_Website >  0.50
## |   |--- time_spent_on_website <= 414.00
## |   |   |--- profile_completed_Medium <= 0.50
## |   |   |   |--- age <= 27.50
## |   |   |   |   |--- value: [0.11]
## |   |   |   |--- age >  27.50
## |   |   |   |   |--- value: [0.58]
## |   |   |--- profile_completed_Medium >  0.50
## |   |   |   |--- last_activity_Website Activity <= 0.50
## |   |   |   |   |--- value: [0.00]
## |   |   |   |--- last_activity_Website Activity >  0.50
## |   |   |   |   |--- value: [0.10]
## |   |--- time_spent_on_website >  414.00
## |   |   |--- last_activity_Phone Activity <= 0.50
## |   |   |   |--- age <= 25.00
## |   |   |   |   |--- value: [0.37]
## |   |   |   |--- age >  25.00
## |   |   |   |   |--- value: [0.83]
## |   |   |--- last_activity_Phone Activity >  0.50
## |   |   |   |--- profile_completed_Medium <= 0.50
## |   |   |   |   |--- value: [0.62]
## |   |   |   |--- profile_completed_Medium >  0.50
## |   |   |   |   |--- value: [0.12]
Show the code
# Importance of features in the tree building

feature_names = list(X_train.columns)
importances = dt_regressor.feature_importances_
indices = np.argsort(importances)

plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
## ([<matplotlib.axis.YTick object at 0x000001E708352990>, <matplotlib.axis.YTick object at 0x000001E708283610>, <matplotlib.axis.YTick object at 0x000001E708282AD0>, <matplotlib.axis.YTick object at 0x000001E708283250>, <matplotlib.axis.YTick object at 0x000001E708283C50>, <matplotlib.axis.YTick object at 0x000001E701C62210>, <matplotlib.axis.YTick object at 0x000001E7083C8B90>, <matplotlib.axis.YTick object at 0x000001E7083C9310>, <matplotlib.axis.YTick object at 0x000001E7083C9A90>, <matplotlib.axis.YTick object at 0x000001E7083CA210>, <matplotlib.axis.YTick object at 0x000001E7083CA990>, <matplotlib.axis.YTick object at 0x000001E7083CB110>, <matplotlib.axis.YTick object at 0x000001E7083CB890>, <matplotlib.axis.YTick object at 0x000001E70830D310>, <matplotlib.axis.YTick object at 0x000001E70830DBD0>, <matplotlib.axis.YTick object at 0x000001E70830E350>], [Text(0, 0, 'page_views_per_visit'), Text(0, 1, 'current_occupation_Unemployed'), Text(0, 2, 'profile_completed_Low'), Text(0, 3, 'print_media_type1_Yes'), Text(0, 4, 'print_media_type2_Yes'), Text(0, 5, 'digital_media_Yes'), Text(0, 6, 'educational_channels_Yes'), Text(0, 7, 'website_visits'), Text(0, 8, 'referral_Yes'), Text(0, 9, 'current_occupation_Student'), Text(0, 10, 'age'), Text(0, 11, 'last_activity_Phone Activity'), Text(0, 12, 'last_activity_Website Activity'), Text(0, 13, 'profile_completed_Medium'), Text(0, 14, 'time_spent_on_website'), Text(0, 15, 'first_interaction_Website')])
plt.xlabel("Relative Importance")
plt.tight_layout()
plt.show()

Observations

  • Root Node: First interaction via website <=0.50
  • Internal Nodes:
    • Time spent on website (~414-419.50 seconds)
    • Profile completed to medium level <=0.50
    • Age <= 18.50 years
    • Last activity via website <= 0.50
    • Last activity via phone <=0.50

Interpretations and Conclusions:

  • The decision tree starts with a split on the First interaction via website feature
  • If the patient did not have a first interaction via the Website, the tree proceeds to consider the time spent on the website (~414-419.50)
  • At the third level, age, whether last activity was via website/phone and profile completed to medium level were included in the decision tree
Show the code
# Checking performance on the training dataset

pred_train_dt = dt_regressor.predict(X_train)

metrics_score(y_train, pred_train_dt)
##               precision    recall  f1-score   support
## 
##            0       0.95      0.80      0.87      2264
##            1       0.66      0.89      0.76       964
## 
##     accuracy                           0.83      3228
##    macro avg       0.80      0.85      0.82      3228
## weighted avg       0.86      0.83      0.84      3228

Observation

  • Recall = 0.89 and Precision = 0.66 which indicates this decision tree is more sensitive to detecting positive outcomes
Show the code
# Checking performance on the testing data
y_pred_test_dt = dt_regressor.predict(X_test)

metrics_score(y_test, y_pred_test_dt)
##               precision    recall  f1-score   support
## 
##            0       0.93      0.80      0.86       971
##            1       0.64      0.87      0.74       413
## 
##     accuracy                           0.82      1384
##    macro avg       0.79      0.83      0.80      1384
## weighted avg       0.85      0.82      0.82      1384

Observations

  • The F1 score for the training dataset was 0.76 and for the test dataset was 0.74, which indicates the decision tree is model is generalising well to the test dataset and is not overfitting the data.

Tree pruning

Show the code
# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 7),
              'criterion': ['gini', 'entropy'],
              'min_samples_leaf': [5, 10, 20, 25]
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search
gridCV = GridSearchCV(dt_regressor, parameters, scoring = scorer, cv = 10)

# Fitting the grid search on the train data
gridCV = gridCV.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
dtree_estimator = gridCV.best_estimator_

# Fit the best estimator to the data
dtree_estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(3),
                       min_samples_leaf=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Show the code
# Fitting the model
dtree_estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(3),
                       min_samples_leaf=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Show the code

# Model Performance on the test data, i.e., prediction
dtree_estimator_perf_test = model_performance_regression(dtree_estimator, X_test, y_test)

dtree_estimator_perf_test
##        RMSE       MAE  R-squared  Adj. R-squared  MAPE
## 0  0.471745  0.222543  -0.062961       -0.075403   inf

Observations

  • Adj R-squared has gone down to almost zero, which indicates pruning of this decision tree to max 3 levels with min n=5 cases per leaf should be avoided.
Show the code

tree.plot_tree(dtree_estimator, feature_names = features, filled = True, fontsize = 8)
plt.tight_layout()
plt.show()

Show the code
print(tree.export_text(dt_regressor_visualize, feature_names=X_train.columns.tolist(), show_weights=True))
## |--- first_interaction_Website <= 0.50
## |   |--- time_spent_on_website <= 419.50
## |   |   |--- age <= 18.50
## |   |   |   |--- website_visits <= 4.50
## |   |   |   |   |--- value: [0.00]
## |   |   |   |--- website_visits >  4.50
## |   |   |   |   |--- value: [0.50]
## |   |   |--- age >  18.50
## |   |   |   |--- age <= 24.50
## |   |   |   |   |--- value: [0.01]
## |   |   |   |--- age >  24.50
## |   |   |   |   |--- value: [0.00]
## |   |--- time_spent_on_website >  419.50
## |   |   |--- last_activity_Website Activity <= 0.50
## |   |   |   |--- referral_Yes <= 0.50
## |   |   |   |   |--- value: [0.16]
## |   |   |   |--- referral_Yes >  0.50
## |   |   |   |   |--- value: [0.56]
## |   |   |--- last_activity_Website Activity >  0.50
## |   |   |   |--- profile_completed_Medium <= 0.50
## |   |   |   |   |--- value: [0.34]
## |   |   |   |--- profile_completed_Medium >  0.50
## |   |   |   |   |--- value: [0.59]
## |--- first_interaction_Website >  0.50
## |   |--- time_spent_on_website <= 414.00
## |   |   |--- profile_completed_Medium <= 0.50
## |   |   |   |--- age <= 27.50
## |   |   |   |   |--- value: [0.11]
## |   |   |   |--- age >  27.50
## |   |   |   |   |--- value: [0.58]
## |   |   |--- profile_completed_Medium >  0.50
## |   |   |   |--- last_activity_Website Activity <= 0.50
## |   |   |   |   |--- value: [0.00]
## |   |   |   |--- last_activity_Website Activity >  0.50
## |   |   |   |   |--- value: [0.10]
## |   |--- time_spent_on_website >  414.00
## |   |   |--- last_activity_Phone Activity <= 0.50
## |   |   |   |--- age <= 25.00
## |   |   |   |   |--- value: [0.37]
## |   |   |   |--- age >  25.00
## |   |   |   |   |--- value: [0.83]
## |   |   |--- last_activity_Phone Activity >  0.50
## |   |   |   |--- profile_completed_Medium <= 0.50
## |   |   |   |   |--- value: [0.62]
## |   |   |   |--- profile_completed_Medium >  0.50
## |   |   |   |   |--- value: [0.12]
Show the code
# Checking performance on the training dataset
y_train_pred_dt = dtree_estimator.predict(X_train)

metrics_score(y_train, y_train_pred_dt)
##               precision    recall  f1-score   support
## 
##            0       0.95      0.75      0.84      2264
##            1       0.60      0.91      0.72       964
## 
##     accuracy                           0.79      3228
##    macro avg       0.78      0.83      0.78      3228
## weighted avg       0.85      0.79      0.80      3228

Show the code
# Checking performance on the test dataset
y_test_pred_dt = dtree_estimator.predict(X_test)

metrics_score(y_test, y_test_pred_dt)
##               precision    recall  f1-score   support
## 
##            0       0.94      0.73      0.82       971
##            1       0.58      0.89      0.70       413
## 
##     accuracy                           0.78      1384
##    macro avg       0.76      0.81      0.76      1384
## weighted avg       0.83      0.78      0.79      1384

Observations

  • The F1 score has also dropped for the training and test datasets for the pruned model
Show the code
importances = dtree_estimator.feature_importances_

columns = X.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
sns.barplot(x = importance_df.Importance, y = importance_df.index).set(ylabel='')
plt.tight_layout()

Observations

  • The variable age is no longer rated as an important feature in the pruned model

Building a Random Forest model

Show the code
# Random Forest Regressor
rf_classifier = RandomForestClassifier(class_weight='balanced', random_state = 1)

# Fitting the model
rf_classifier.fit(X_train, y_train)
RandomForestClassifier(class_weight='balanced', random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Show the code

# Model Performance on the test data
rf_classifier_perf_test = model_performance_regression(rf_classifier, X_test, y_test)

rf_classifier_perf_test
##        RMSE       MAE  R-squared  Adj. R-squared  MAPE
## 0  0.368562  0.135838   0.351179        0.343585   inf

Observations

  • The adjusted R-square is much higher for the RF model (approx. 34%) compared to the Decision tree model (approx. 12%)
Show the code
importances = rf_classifier.feature_importances_

columns = X.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

sns.barplot(x=importance_df.Importance, y=importance_df.index).set(ylabel='')
plt.tight_layout()

Observations

  • The variables for time spent on website and first interaction via website are rated as most important (same as the Decision Tree model)
  • Variables for profile completed to medium level, page views per visit and age are also rated as important features (same as the Decision Tree model).
Show the code
# Checking performance on the training dataset

pred_train_rf = rf_classifier.predict(X_train)

metrics_score(y_train, pred_train_rf)
##               precision    recall  f1-score   support
## 
##            0       1.00      1.00      1.00      2264
##            1       1.00      1.00      1.00       964
## 
##     accuracy                           1.00      3228
##    macro avg       1.00      1.00      1.00      3228
## weighted avg       1.00      1.00      1.00      3228

Observations

  • The Random Forest model is overfitting the training dataset and is likely not generalising well to the test dataset.
Show the code
# Checking performance on the testing data
y_pred_test_rf = rf_classifier.predict(X_test)

metrics_score(y_test, y_pred_test_rf)
##               precision    recall  f1-score   support
## 
##            0       0.89      0.92      0.91       971
##            1       0.80      0.72      0.76       413
## 
##     accuracy                           0.86      1384
##    macro avg       0.84      0.82      0.83      1384
## weighted avg       0.86      0.86      0.86      1384

Observations

  • F1 Score = 0.76 is similar to the previous Decision Tree model
  • The Precision score is higher than then Recall score, which indicates this is a more conversative model less likely to rate false positives

Tree pruning

Show the code
# Choose the type of classifier
# rf_estimator_tuned = RandomForestClassifier(class_weight='balanced',random_state = 1)
# 
# # Grid of parameters to choose from
# params_rf = {
#         "n_estimators": [100, 250, 500],
#         "min_samples_leaf": np.arange(1, 4, 1),
#         "max_features": [0.7, 0.9, 'auto'],
# }
# 
# 
# # Type of scoring used to compare parameter combinations - recall score for class 1
# scorer = metrics.make_scorer(recall_score, pos_label = 1)
# 
# # Run the grid search
# grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring = scorer, cv = 5)
# 
# grid_obj = grid_obj.fit(X_train, y_train)
# 
# # Set the classifier to the best combination of parameters
# rf_estimator_tuned = grid_obj.best_estimator_
Show the code
# Fitting the model
# rf_estimator_tuned.fit(X_train, y_train)
# 
# # Model Performance on the test data, i.e., prediction
# rf_estimator_tuned_perf_test = model_performance_regression(rf_estimator_tuned, X_test, y_test)
# 
# rf_estimator_tuned_perf_test

Observations

  • The adjusted R-squared value has reduced for the pruned Random Forest model (~28%)
Show the code
# Checking performance on the training data
# y_pred_train_rf_tuned = rf_estimator_tuned.predict(X_train)
# 
# metrics_score(y_train, y_pred_train_rf_tuned)

Observations

  • The pruned Random Forest model is still slightly overfitting to the data but not as bad as the original model
  • Recall is now higher than Precision
Show the code
# Checking performance on the test data
# y_pred_test_rf_tuned = rf_estimator_tuned.predict(X_test)
# 
# metrics_score(y_test, y_pred_test_rf_tuned)

Observations

  • An F1 score is similar to the best fitting Decision Tree model, with a higher adjusted R-squared value
Show the code
# Plotting feature importance
# importances = rf_estimator_tuned.feature_importances_
# 
# columns = X.columns
# 
# importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)
# 
# sns.barplot(x=importance_df.Importance, y=importance_df.index).set(ylabel='')
# plt.tight_layout()

Conclusions

  • Extra Learn should be investing in converting leads obtained via their website
  • This could include providing more offers for new students to sign up via their website (e.g. special discounts provided to students who sign up via their website) or who have started completing their profile
  • These offers should be targeted at prospective students who spend more time on the website and visit numerous website pages
  • There may be some value in providing telephone support for prospective leads aged over 25 (given they may need more support to sign up compared to signing up via the website)