Recommendation systems are one of the best tools that help recommending products to consumers while browsing online. Providing personalized recommendations which is most relevant for the user is what’s most likely to keep them engaged and help business.
Amazon, for example, is well-known for its accurate selection of recommendations in its online site. Amazon’s recommendation system is capable of intelligently analyzing and predicting customers’ shopping preferences in order to offer them a list of recommended products. Amazon’s recommendation algorithm is therefore a key element in using AI to improve the personalization of its website.
Data Source & Methodology
The dataset contains a list of user ids, product ids, and product ratings provided by each user. This project will build a recommendation system to recommend products to customers based on their previous ratings for other products.
Data Preparation
Show the code
import numpy as npimport pandas as pd# Python libraries for data visualizationimport matplotlib.pyplot as pltimport seaborn as sns# For implementing matrix factorization based recommendation systemfrom surprise.prediction_algorithms.matrix_factorization import SVDimport utils as utils# For implementing cross validationfrom surprise.model_selection import KFoldimport warningswarnings.filterwarnings('ignore')print('All packages imported successfully!')## All packages imported successfully!
Load the dataset
Show the code
# Load the data into csv formatdata = pd.read_csv('./data/ratings_Electronics.csv', names = ['user_id', 'prod_id', 'rating', 'timestamp'])# Drop column timestamp and copy data to dataframe called dfdata.drop(['timestamp'], axis =1, inplace =True)df = data.copy()df.head()## user_id prod_id rating## 0 AKM1MP6P0OYPR 0132793040 5.0## 1 A2CX7LUOHB2NDG 0321732944 5.0## 2 A2NWSAGRHCP8N5 0439886341 1.0## 3 A2WNBOD3WNDNKT 0439886341 3.0## 4 A1GI0U4ZRJA8WN 0439886341 1.0
Create inscope dataset
As this dataset is very large and has 7,824,482 observations, it is not computationally possible to build a model on a local computer. In addition, many users have only rated a few products and also some products are rated by very few users so we can reduce the dataset by considering certain assumptions. We will consider the following as inscope for the dataset:
users who have given at least 50 ratings
products that have at least 5 ratings (as when we shop online we prefer to have some number of ratings of a product)
Show the code
# Get the column containing the usersusers = df.user_id# Create a dictionary from users to their number of ratingsratings_count =dict()for user in users:# If we already have the user, just add 1 to their rating countif user in ratings_count: ratings_count[user] +=1# Otherwise, set their rating count to 1else: ratings_count[user] =1# We want our users to have at least 50 ratings to be consideredRATINGS_CUTOFF =50remove_users = []for user, num_ratings in ratings_count.items():if num_ratings < RATINGS_CUTOFF: remove_users.append(user)df = df.loc[ ~ df.user_id.isin(remove_users)]# Get the column containing the productsprods = df.prod_id# Create a dictionary from products to their number of ratingsratings_count =dict()for prod in prods:# If we already have the product, just add 1 to its rating countif prod in ratings_count: ratings_count[prod] +=1# Otherwise, set their rating count to 1else: ratings_count[prod] =1# We want our item to have at least 5 ratings to be consideredRATINGS_CUTOFF =5remove_users = []for user, num_ratings in ratings_count.items():if num_ratings < RATINGS_CUTOFF: remove_users.append(user)df_final = df.loc[~ df.prod_id.isin(remove_users)]# Summary statistics of 'rating' variable and provide observationsdf_final['rating'].describe()## count 65290.000000## mean 4.294808## std 0.988915## min 1.000000## 25% 4.000000## 50% 5.000000## 75% 5.000000## max 5.000000## Name: rating, dtype: float64
The rating variables has mean = 4.29, which indicates users are more likely to rate products positively. This may be due to users who are not satisfied with products choosing not to rate products.
Show the code
# Number of total rows in the data and number of unique user id and product id in the datatotal_num = df_final.shape[0]user_num = df_final['user_id'].nunique()prod_num = df_final['prod_id'].nunique()avg_ratings = total_num / user_numprint('Total number of rows: '+str(total_num))## Total number of rows: 65290print('Number of unique user ids: '+str(user_num))## Number of unique user ids: 1540print('Number of unique product ids: '+str(prod_num))## Number of unique product ids: 5689print('Average ratings per use: '+str(avg_ratings))## Average ratings per use: 42.396103896103895
After limiting the data file to users with at least 50 ratings and products with at least 5 ratings, there are 65,290 available for analysis. There are 1,540 users and 5,689 products listed in this data file. This indicates that each user in the data file have provided ratings for, on average, 42.4 products.
Show the code
# Distribution of ratingsplt.figure(figsize = (12, 4))sns.countplot(x="rating", data=df)plt.tick_params(labelsize =10)plt.title("Distribution of Ratings ", fontsize =10)plt.xlabel("Ratings", fontsize =10)plt.ylabel("Number of Ratings", fontsize =10)plt.show()
Run Models
Model 1: Rank Based Recommendation System
Show the code
# Calculate the average rating for each productaverage_rating = df_final.groupby('prod_id').mean(numeric_only=True)['rating']# Calculate the count of ratings for each productn_rating = df_final.groupby('prod_id').count()['rating']# Create a dataframe with calculated average and count of ratingssum_prod_df = pd.DataFrame(average_rating).set_axis(['average'],axis=1).join(pd.DataFrame(n_rating).set_axis(['count'],axis=1))# Sort the dataframe by average of ratings in the descending ordersum_prod_df_sort = sum_prod_df.sort_values(by ='average',ascending =False).copy()# See the first five records of the "final_rating" datasetsum_prod_df_sort.head()## average count## prod_id ## B00LGQ6HL8 5.0 5## B003DZJQQI 5.0 14## B005FDXF2C 5.0 7## B00I6CVPVC 5.0 7## B00B9KOCYA 5.0 8
We have recommended the top 5 products by using the popularity recommendation system. Now, let’s build a recommendation system using collaborative filtering.
Model 2: Collaborative Filtering Recommendation System
we are building similarity-based recommendation systems using cosine similarity and using KNN to find similar users which are the nearest neighbor to the given user.
First we import the relevant packages from the surprise module:
Show the code
# Class is used to parse a file containing ratings, data should be in structure - user ; item ; ratingfrom surprise.reader import Reader# Class for loading datasetsfrom surprise.dataset import Dataset# For tuning model hyperparametersfrom surprise.model_selection import GridSearchCV# For splitting the rating data in train and test datasetsfrom surprise.model_selection import train_test_split# For implementing similarity-based recommendation systemfrom surprise.prediction_algorithms.knns import KNNBasic# For implementing matrix factorization based recommendation systemfrom surprise.prediction_algorithms.matrix_factorization import SVD# for implementing K-Fold cross-validationfrom surprise.model_selection import KFold# For implementing clustering-based recommendation systemfrom surprise import CoClustering
Below we are loading the rating dataset, which is a pandas DataFrame, into a different format called surprise.dataset.DatasetAutoFolds, which is required by this library. To do this, we will be using the classes Reader and Dataset.
Show the code
# Instantiating Reader scale with expected rating scalereader = Reader(rating_scale=(1, 5))# Loading the rating datasetdata = Dataset.load_from_df(df_final[['user_id', 'prod_id', 'rating']], reader)# Splitting the data into train and test datasettrainset, testset = train_test_split(data, test_size=0.3, random_state=42)
Next we are building the user-user Similarity-based Recommendation System:
Show the code
# Declaring the similarity optionssim_options = {'name': 'cosine','user_based': True}# Initialize the KNNBasic model using sim_options declared, Verbose = False, and setting random_state = 1algo_knn_user = KNNBasic(sim_options=sim_options,verbose=False)# Fit the model on the training dataalgo_knn_user.fit(trainset)## <surprise.prediction_algorithms.knns.KNNBasic object at 0x000001D926DC2E20># Let us compute precision@k, recall@k, and f_1 score using the precision_recall_at_k function defined aboveutils.precision_recall_at_k(algo_knn_user,testset)## RMSE: 1.0250## Precision: 0.86## Recall: 0.783## F_1 score: 0.82
The baseline user-user model has RMSE = 1.03. Recall = 0.78 indicates that out of all the relevant products, 78% are recommended. Precision = 0.86 indicates that out of all recommended products, 86% are relevant. F1 Score = 0.82 indicates that most recommended products and relevant and most relevant products are recommended.
Next we are tuning hyperparameters for the KNNBasic algorithm. The hyperparameters of the KNNBasic algorithm include:
k (int) – The (max) number of neighbors to take into account for aggregation. Default is 40.
min_k (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set to the global mean of all ratings. Default is 1.
sim_options (dict) – A dictionary of options for the similarity measures (cosine, msd, Pearon, Pearson baseline)
Show the code
# Setting up parameter grid to tune the hyperparametersparam_grid = {'k': [20, 30, 40], 'min_k': [3, 6, 9],'sim_options': {'name': ['msd', 'cosine'],'user_based': [True]} }# Performing 3-fold cross validation to tune the hyperparametersgs = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=1)# Fitting the datags.fit(data)## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.# Best RMSE scoreprint(gs.best_score['rmse'])## 0.9709899139890842# Combination of parameters that gave the best RMSE scoreprint(gs.best_params['rmse'])## {'k': 40, 'min_k': 6, 'sim_options': {'name': 'cosine', 'user_based': True}}
Now we build the final model by using tuned values of the hyperparameters, which we received by using grid search cross-validation.
Show the code
# Using the optimal similarity measure for user-user based collaborative filteringsim_options = {'name': 'cosine','user_based': True}# Creating an instance of KNNBasic with optimal hyperparameter valuessimilarity_algo_optimized = KNNBasic(sim_options=sim_options, k=40, min_k=6, verbose=False)# Training the algorithm on the train setsimilarity_algo_optimized.fit(trainset)## <surprise.prediction_algorithms.knns.KNNBasic object at 0x000001D926DB8250># Let us compute precision@k and recall@k with k=10.utils.precision_recall_at_k(similarity_algo_optimized,testset)## RMSE: 0.9630## Precision: 0.85## Recall: 0.809## F_1 score: 0.829
Recall has increased from 0.78 to 0.81 in the optimized user-user model, meaning more products that are relevant are recommended in the optimized model. The F1 Score has also increased from 0.82 to 0.83.
We can find out similar users to a given user or its nearest neighbors based on this KNNBasic algorithm. Below, we are finding the 5 most similar users to the first user in the list with internal id 0, based on the msd distance metric.
Show the code
# Using the optimal similarity measure for user-user based collaborative filteringsim_options = {'name': 'msd','user_based': True}# Creating an instance of KNNBasic with optimal hyperparameter valuessimilarity_algo_optimized_msd = KNNBasic(sim_options=sim_options, k=40, min_k=6, verbose=False)# Training the algorithm on the train setsimilarity_algo_optimized_msd.fit(trainset)## <surprise.prediction_algorithms.knns.KNNBasic object at 0x000001D923EA0B50># 0 is the inner id of the above usersimilarity_algo_optimized_msd.get_neighbors(0,k =5)## [16, 42, 44, 54, 58]
Now let us look into similarity-based collaborative filtering where similarity is seen between items.
Show the code
# Declaring the similarity optionssim_options = {'name': 'cosine','user_based': False}# KNN algorithm is used to find desired similar items. Use random_state=1algo_knn_item = KNNBasic(sim_options=sim_options,verbose=False,random_state =1)# Train the algorithm on the trainset, and predict ratings for the test setalgo_knn_item.fit(trainset)## <surprise.prediction_algorithms.knns.KNNBasic object at 0x000001D924D3BD00># Let us compute precision@k, recall@k, and f_1 score with k=10utils.precision_recall_at_k(algo_knn_item,testset)## RMSE: 1.0232## Precision: 0.835## Recall: 0.758## F_1 score: 0.795
The baseline user-user model has RMSE = 1.02. Recall = 0.76 indicates that out of all the relevant products, 76% are recommended. Precision = 0.84 indicates that out of all recommended products, 84% are relevant. F1 Score = 0.80 indicates that most recommended products and relevant and most relevant products are recommended. The item-item baseline model has slighly lower benchmarks compared to the user-user baseline model.
Show the code
# Setting up parameter grid to tune the hyperparametersparam_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],'sim_options': {'name': ['msd', 'cosine'],'user_based': [False]} }# Performing 3-fold cross validation to tune the hyperparametersgs = GridSearchCV(KNNBasic, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=1)# Fitting the datags.fit(data)## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the msd similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.## Computing the cosine similarity matrix...## Done computing similarity matrix.# Find the best RMSE scoreprint(gs.best_score['rmse'])## 0.9748912606019764# Find the combination of parameters that gave the best RMSE scoreprint(gs.best_params['rmse'])## {'k': 20, 'min_k': 6, 'sim_options': {'name': 'msd', 'user_based': False}}
Now let’s build the final model by using tuned values of the hyperparameters which we received by using grid search cross-validation.
Show the code
# Using the optimal similarity measure for item-item based collaborative filtering# Creating an instance of KNNBasic with optimal hyperparameter valuessimilarity_algo_optimized_item = KNNBasic(sim_options={'name': 'msd', 'user_based': False}, k=30, min_k=9,verbose=False)# Training the algorithm on the train setsimilarity_algo_optimized_item.fit(trainset)## <surprise.prediction_algorithms.knns.KNNBasic object at 0x000001D926D5B8E0># Let us compute precision@k and recall@k, f1_score and RMSEutils.precision_recall_at_k(similarity_algo_optimized_item,testset)## RMSE: 0.9681## Precision: 0.836## Recall: 0.8## F_1 score: 0.818
Similar to the user-user model, the item-item model with tuned hyperparameters has improved recall compared to the baseline model. Recall has increased from 0.76 to 0.8. F1 score has increased from 0.80 to 0.82
Model 3: Model-Based Collaborative Filtering - Matrix Factorization
Model-based Collaborative Filtering is a personalized recommendation system, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use latent features to find recommendations for each user.
SVD (Singular Value Decomposition) is used to compute the latent features from the user-item matrix. But SVD does not work when we miss values in the user-item matrix.
Show the code
# Using SVD matrix factorization. Use random_state = 1svd = SVD(random_state=1)# Training the algorithm on the trainsetsvd.fit(trainset)## <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x000001D924D3B550># Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSEutils.precision_recall_at_k(svd,testset)## RMSE: 0.8989## Precision: 0.86## Recall: 0.797## F_1 score: 0.827
The baseline SVD model has RMSE = 0.90. Recall = 0.80 indicates that out of all the relevant products, 80% are recommended. Precision = 0.86 indicates that out of all recommended products, 86% are relevant. F1 Score = 0.83 indicates that most recommended products and relevant and most relevant products are recommended.
Next we will improve Matrix Factorization based recommendation system by tuning its hyperparameters. Below we will be tuning only three hyperparameters:
n_epochs: The number of iterations of the SGD algorithm.
lr_all: The learning rate for all parameters.
reg_all: The regularization term for all parameters.
Show the code
# Set the parameter space to tuneparam_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],'reg_all': [0.2, 0.4, 0.6]}# Performing 3-fold gridsearch cross-validationgs_ = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3, n_jobs=1)# Fitting datags_.fit(data)# Best RMSE scoreprint(gs_.best_score['rmse'])## 0.8982515149903835# Combination of parameters that gave the best RMSE scoreprint(gs_.best_params['rmse'])## {'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.4}
Now, we will the build final model by using tuned values of the hyperparameters, which we received using grid search cross-validation above.
Show the code
# Build the optimized SVD model using optimal hyperparameter search. Use random_state=1svd_optimized = SVD(n_epochs=20, lr_all=0.001, reg_all=0.2, random_state=1)# Train the algorithm on the trainsetsvd_optimized=svd_optimized.fit(trainset)# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score, and RMSEutils.precision_recall_at_k(svd_optimized,testset)## RMSE: 0.9277## Precision: 0.854## Recall: 0.813## F_1 score: 0.833
With tuned hyperparameters, the SVD model has achieved slightly higher model indicators compared to the baseline model. Recall = 0.81, indicating 81% of relevant products were recommended. Precision = 0.85, indicating 85% of recommended products were relevant. F1 score = 0.83, which indicates a high proportion of recommended products are relevant, and a high proportion of relevant products are recommended.
Conclusions
Overall, the SVD model with tuned hyperparameters achieved the highest model indicator ratings, including an F1 score of 0.833. The SVD model assumes both products and users are present in some low dimensional space, which implies it accounts for latent factors (unlike the similarity-based models). Unlike the user-user and item-item similarity-based recommendation systems, the SVD model was able to make predictions for users who had not rated products using enough nearest neighbours.
The item-item similarity-based model was able to predict ratings similar to actual ratings better than the user-user similarity-based model and SVD model. Next steps include combining different recommendation techniques to build a more complex model like hybrid recommendation systems