This project was completed as part of MIT IDSS Data Science and Machine Learning course. It has extra relevance for me given my experience in the education sector and previous analysis of churn models.
Extra Learn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by Extra Learn is to identify which of the leads are more likely to convert so that they can allocate resources accordingly.
Data Source & Methodology
The data contains the different attributes of leads and their interaction details with Extra Learn. Some examples of variables provided in the dataset include: - age - source of first interaction with website - whether website profile was completed - number of vists to the website - time spent on the website - page views per visit - email / phone / website interaction with potential customer - whether the lead was referred to ExtraLearnby someone
Most importantly, the dataset contains the outcome variable status - a flag indicating whether the lead was converted to a paid customer or not.
With this dataset, we will use Decision Tree and Random Forest models to: * Analyze and build an ML model to help identify which leads are more likely to convert to paid customers, * Find the factors driving the lead conversion process * Create a profile of the leads which are likely to convert
Data Preparation
Show the code
# Importing the basic libraries we will require for the project# Libraries to help with reading and manipulating dataimport pandas as pdimport numpy as np# Libaries to help with data visualizationimport matplotlib.pyplot as pltimport seaborn as snssns.set()# Importing the Machine Learning models we require from Scikit-Learnfrom sklearn.linear_model import LogisticRegressionfrom sklearn.svm import SVCfrom sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressorfrom sklearn import treefrom sklearn.metrics import make_scorer,mean_squared_error, r2_score, mean_absolute_error, recall_scorefrom sklearn import metricsfrom sklearn.ensemble import RandomForestClassifier, RandomForestRegressor# Importing the other functions we may require from Scikit-Learnfrom sklearn.model_selection import train_test_split, GridSearchCVfrom sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoderfrom sklearn.impute import SimpleImputer# To get diferent metric scoresfrom sklearn.metrics import confusion_matrix,classification_report,roc_auc_score,precision_recall_curve,roc_curve,make_scorer# Code to ignore warnings from function usageimport warnings;import numpy as npwarnings.filterwarnings('ignore')
Load in dataset
Show the code
# Read the healthcare dataset filedata=pd.read_csv("ExtraaLearn.csv")# Copying data to another variable to avoid any changes to original datasame_data = data.copy()# View first 5 rowsdata.head()## ID age current_occupation ... educational_channels referral status## 0 EXT001 57 Unemployed ... No No 1## 1 EXT002 56 Professional ... Yes No 0## 2 EXT003 52 Professional ... No No 0## 3 EXT004 53 Unemployed ... No No 1## 4 EXT005 23 Student ... No No 0## ## [5 rows x 15 columns]
Data Preprocessing
Outlier detection and treatment
Show the code
# Defining the hist_box() functiondef hist_box(data, col): f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (0.15, 0.85)}, figsize=(12, 6))# Adding a graph in each part sns.boxplot(data=data, x=col, ax=ax_box, showmeans=True) sns.histplot(data=data, x=col, kde=True, ax=ax_hist) plt.show()
Show the code
num_col = ['age','website_visits','time_spent_on_website','page_views_per_visit']for column in num_col: hist_box(data,column)
Observations * The variables for number of website vists, time spent on website and page views per visit were skewed towards the right * The distribution of age was mostly uniform except for a small spike of records around 60 years old
Observations
The variable with the most values beyond the upper IQR is page views per visit
Looking at the highest five values, they do not seem very far apart so it is plausible that these values are reflective of reality
Given this and the low rate of outlier values in the rest of the dataset, it was decided to leave data unchanged
Data preparation for modeling
Show the code
# We are removing the outcome variable from the feature set# Also removing the variable ID as it is unique to each record and not required for analysisX = data.drop(['status', 'ID'], axis =1)# And then we are extracting the outcome variable separatelyY = data['status']
# Splitting the data into train and test setsX_train,X_test,y_train,y_test=train_test_split(X,Y,test_size=0.30,random_state=1,stratify=Y)
Show the code
# Checking the shape of the train and test dataprint("Shape of Training set : ", X_train.shape)## Shape of Training set : (3228, 16)print("Shape of test set : ", X_test.shape)## Shape of test set : (1384, 16)
y_train.describe().T## count 3228.000000## mean 0.298637## std 0.457731## min 0.000000## 25% 0.000000## 50% 0.000000## 75% 1.000000## max 1.000000## Name: status, dtype: float64
Show the code
y_test.describe().T## count 1384.000000## mean 0.298410## std 0.457726## min 0.000000## 25% 0.000000## 50% 0.000000## 75% 1.000000## max 1.000000## Name: status, dtype: float64
Observations
Splitting the X and Y datasets into a training and test set has resulted in evently distributed data across both datasets
Building a Decision Tree model
Show the code
# Function to compute adjusted R-squareddef adj_r2_score(predictors, targets, predictions): r2 = r2_score(targets, predictions) n = predictors.shape[0] k = predictors.shape[1]return1- ((1- r2) * (n -1) / (n - k -1))# Function to compute MAPEdef mape_score(targets, predictions):return np.mean(np.abs(targets - predictions) / targets) *100# Function to compute different metrics to check performance of a regression modeldef model_performance_regression(model, predictors, target):""" Function to compute different metrics to check regression model performance model: regressor predictors: independent variables target: dependent variable """ pred = model.predict(predictors) # Predict using the independent variables r2 = r2_score(target, pred) # To compute R-squared adjr2 = adj_r2_score(predictors, target, pred) # To compute adjusted R-squared rmse = np.sqrt(mean_squared_error(target, pred)) # To compute RMSE mae = mean_absolute_error(target, pred) # To compute MAE mape = mape_score(target, pred) # To compute MAPE# Creating a dataframe of metrics df_perf = pd.DataFrame( {"RMSE": rmse,"MAE": mae,"R-squared": r2,"Adj. R-squared": adjr2,"MAPE": mape, }, index=[0], )return df_perf
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Model Performance on the test data, i.e., predictiondt_regressor_perf_test = model_performance_regression(dt_regressor, X_test, y_test)dt_regressor_perf_test## RMSE MAE R-squared Adj. R-squared MAPE## 0 0.426709 0.182081 0.130304 0.120125 inf
Observations
The initial Decision Tree on the training set has very low adjusted R square of ~12% and is likely not effective at predicting the outcome variable
Show the code
from sklearn import treefeatures =list(X_train.columns)# Building the model with max_depth=3dt_regressor_visualize = DecisionTreeRegressor(random_state =1, max_depth =4)# Fitting the modeldt_regressor_visualize.fit(X_train, y_train)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
The F1 score for the training dataset was 0.76 and for the test dataset was 0.74, which indicates the decision tree is model is generalising well to the test dataset and is not overfitting the data.
Tree pruning
Show the code
# Grid of parameters to choose fromparameters = {'max_depth': np.arange(2, 7),'criterion': ['gini', 'entropy'],'min_samples_leaf': [5, 10, 20, 25] }# Type of scoring used to compare parameter combinationsscorer = metrics.make_scorer(recall_score, pos_label =1)# Run the grid searchgridCV = GridSearchCV(dt_regressor, parameters, scoring = scorer, cv =10)# Fitting the grid search on the train datagridCV = gridCV.fit(X_train, y_train)# Set the classifier to the best combination of parametersdtree_estimator = gridCV.best_estimator_# Fit the best estimator to the datadtree_estimator.fit(X_train, y_train)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Model Performance on the test data, i.e., predictiondtree_estimator_perf_test = model_performance_regression(dtree_estimator, X_test, y_test)dtree_estimator_perf_test## RMSE MAE R-squared Adj. R-squared MAPE## 0 0.471745 0.222543 -0.062961 -0.075403 inf
Observations
Adj R-squared has gone down to almost zero, which indicates pruning of this decision tree to max 3 levels with min n=5 cases per leaf should be avoided.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# Model Performance on the test datarf_classifier_perf_test = model_performance_regression(rf_classifier, X_test, y_test)rf_classifier_perf_test## RMSE MAE R-squared Adj. R-squared MAPE## 0 0.368562 0.135838 0.351179 0.343585 inf
Observations
The adjusted R-square is much higher for the RF model (approx. 34%) compared to the Decision tree model (approx. 12%)
F1 Score = 0.76 is similar to the previous Decision Tree model
The Precision score is higher than then Recall score, which indicates this is a more conversative model less likely to rate false positives
Tree pruning
Show the code
# Choose the type of classifier# rf_estimator_tuned = RandomForestClassifier(class_weight='balanced',random_state = 1)# # # Grid of parameters to choose from# params_rf = {# "n_estimators": [100, 250, 500],# "min_samples_leaf": np.arange(1, 4, 1),# "max_features": [0.7, 0.9, 'auto'],# }# # # # Type of scoring used to compare parameter combinations - recall score for class 1# scorer = metrics.make_scorer(recall_score, pos_label = 1)# # # Run the grid search# grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring = scorer, cv = 5)# # grid_obj = grid_obj.fit(X_train, y_train)# # # Set the classifier to the best combination of parameters# rf_estimator_tuned = grid_obj.best_estimator_
Show the code
# Fitting the model# rf_estimator_tuned.fit(X_train, y_train)# # # Model Performance on the test data, i.e., prediction# rf_estimator_tuned_perf_test = model_performance_regression(rf_estimator_tuned, X_test, y_test)# # rf_estimator_tuned_perf_test
Observations
The adjusted R-squared value has reduced for the pruned Random Forest model (~28%)
Show the code
# Checking performance on the training data# y_pred_train_rf_tuned = rf_estimator_tuned.predict(X_train)# # metrics_score(y_train, y_pred_train_rf_tuned)
Observations
The pruned Random Forest model is still slightly overfitting to the data but not as bad as the original model
Recall is now higher than Precision
Show the code
# Checking performance on the test data# y_pred_test_rf_tuned = rf_estimator_tuned.predict(X_test)# # metrics_score(y_test, y_pred_test_rf_tuned)
Observations
An F1 score is similar to the best fitting Decision Tree model, with a higher adjusted R-squared value
Extra Learn should be investing in converting leads obtained via their website
This could include providing more offers for new students to sign up via their website (e.g. special discounts provided to students who sign up via their website) or who have started completing their profile
These offers should be targeted at prospective students who spend more time on the website and visit numerous website pages
There may be some value in providing telephone support for prospective leads aged over 25 (given they may need more support to sign up compared to signing up via the website)